End-To-End Autonomous Driving Challenges and Frontiers
End-To-End Autonomous Driving Challenges and Frontiers
(Survey paper)
Abstract—The autonomous driving community has witnessed as perception, prediction, and planning, is individually devel-
a rapid growth in approaches that embrace an end-to-end algo- oped and integrated into onboard vehicles. The planning or
rithm framework, utilizing raw sensor input to generate vehicle control module, responsible for generating steering and accel-
motion plans, instead of concentrating on individual tasks such as
detection and motion prediction. End-to-end systems, in compar- eration outputs, plays a crucial role in determining the driving
ison to modular pipelines, benefit from joint feature optimization experience. The most common approach for planning in modular
for perception and planning. This field has flourished due to the pipelines involves using sophisticated rule-based designs, which
availability of large-scale datasets, closed-loop evaluation, and the are often ineffective in addressing the vast number of situations
increasing need for autonomous driving algorithms to perform that occur on road. Therefore, there is a growing trend to leverage
effectively in challenging scenarios. In this survey, we provide a
comprehensive analysis of more than 270 papers, covering the mo- large-scale data and to use learning-based planning as a viable
tivation, roadmap, methodology, challenges, and future trends in alternative.
end-to-end autonomous driving. We delve into several critical chal- We define end-to-end autonomous driving systems as fully
lenges, including multi-modality, interpretability, causal confusion, differentiable programs that take raw sensor data as input and
robustness, and world models, amongst others. Additionally, we produce a plan and/or low-level control actions as output.
discuss current advancements in foundation models and visual
pre-training, as well as how to incorporate these techniques within Fig. 1(a)-(b) illustrates the difference between the classical
the end-to-end driving framework. and end-to-end formulation. The conventional approach feeds
the output of each component, such as bounding boxes and
Index Terms—Autonomous driving, end-to-end system design,
policy learning, simulation.
vehicle trajectories, directly into subsequent units (dashed ar-
rows). In contrast, the end-to-end paradigm propagates fea-
ture representations across components (gray solid arrow). The
I. INTRODUCTION optimized function is set to be, for example, the planning
ONVENTIONAL autonomous driving systems adopt a performance, and the loss is minimized via back-propagation
C modular design strategy, wherein each functionality, such (red arrow). Tasks are jointly and globally optimized in this
process.
In this survey, we conduct an extensive review of this emerg-
Manuscript received 24 June 2023; revised 19 April 2024; accepted 22 July ing topic. Fig. 1 provides an overview of our work. We be-
2024. Date of publication 30 July 2024; date of current version 5 November 2024.
The work of Li Chen, Penghao Wu, and Hongyang Li were partially supported
gin by discussing the motivation and roadmap for end-to-end
by the National Key R&D Program of China under Grant 2022ZD0160104, in autonomous driving systems. End-to-end approaches can be
part by NSFC under Grant 62206172, and in part by the Shanghai Committee broadly classified into imitation and reinforcement learning,
of Science and Technology under Grant 23YF1462000. The work of Andreas
and we give a brief review of these methodologies. We cover
Geiger and Bernhard Jaeger were supported by the ERC Starting Grant LEGO-
3D (850533), in part by the BMWi in the project KI Delta Learning under Grant datasets and benchmarks for both closed and open-loop eval-
19A19013O, and in part by the DFG EXC number 2064/1 - project number uation. We summarize a series of critical challenges, includ-
390727645. Kashyap Chitta was supported by the German Federal Ministry
of Education and Research (BMBF): Tübingen AI Center, FKZ, under Grant
ing interpretability, generalization, world models, causal con-
01IS18039A. Recommended for acceptance by H. Li. (Corresponding author: fusion, etc. We conclude by discussing future trends that we
Hongyang Li.) think should be embraced by the community to incorporate
Li Chen and Hongyang Li are with the OpenDriveLab, Shanghai AI Lab,
Shanghai 200233, China, and also with the University of Hong Kong, Hong
the latest developments from data engines, and large founda-
Kong, China (e-mail: [email protected]). tion models, amongst others. Note that this review is mainly
Penghao Wu is with the OpenDriveLab, Shanghai AI Lab, Shanghai 200233, orchestrated from a theoretical perspective. Engineering efforts
China.
Kashyap Chitta, Bernhard Jaeger, and Andreas Geiger are with the University
such as version control, unit testing, data servers, data clean-
of Tübingen, 72074 Tübingen, Germany, and also with the Tübingen AI Center, ing, software-hardware co-design, etc., play crucial roles in
72076 Tübingen, Germany. deploying the end-to-end technology. Publicly available infor-
We maintain an active repository that contains up-to-date literature and
open-source projects at https://github.com/OpenDriveLab/End-
mation regarding the latest practices on these topics is limited.
to-end-Autonomous-Driving. We invite the community towards more openness in future
Digital Object Identifier 10.1109/TPAMI.2024.3435937 discussions.
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10165
Fig. 1. Survey at A Glance. (a) Pipeline and Methods. We define end-to-end autonomous driving as a learning-based algorithm framework with raw sensor
input and planning/control output. We deepdive into 270+ papers and categorize into imitation learning (IL) and reinforcement learning (RL). (b) Benchmarking.
We group popular benchmarks into closed-loop and open-loop evaluation, respectively. We cover various aspects of closed-loop simulation and the limitations of
open-loop evaluation for this problem. (c) Challenges. This is the main section of our work. We list key challenges from a wide range of topics and extensively
analyze why these concerns are crucial. Promising resolutions to these challenges are covered as well. (d) Future Trends. We discuss how end-to-end paradigm
could benefit by aid of the rapid development of foundation models, visual pre-training, etc. Partial photos by courtesy of online resources.
A. Motivation of an End-to-End System Note that the end-to-end paradigm does not necessarily indi-
In the classical pipeline, each model serves a standalone cate one black box with only planning/control outputs. It could
have intermediate representations and outputs (Fig. 1(b)) as in
component and corresponds to a specific task (e.g., traffic light
detection). Such a design is beneficial in terms of interpretabil- classical approaches. In fact, several state-of-the-art systems [1],
ity and ease of debugging. However, since the optimization [2] propose a modular design but optimize all components
together to achieve superior performance.
objectives across modules are different, with detection pur-
suing mean average precision (mAP) while planning aiming B. Roadmap
for driving safety and comfort, the entire system may not be
aligned with a unified target, i.e., the ultimate planning/control Fig. 2 depicts a chronological roadmap of critical achieve-
task. Errors from each module, as the sequential procedure ments in end-to-end autonomous driving, where each part indi-
proceeds, could be compounded and result in an information cates an essential paradigm shift or performance boost. The his-
loss. Moreover, compared to one end-to-end neural network, tory of end-to-end autonomous driving dates back to 1988 with
the multi-task, multi-model deployment which involves multiple ALVINN [3], where the input was two “retinas“ from a camera
encoders and message transmission systems, may increase the and a laser range finder, and a simple neural network generated
computational burden and potentially lead to sub-optimal use of steering output. NVIDIA designed a prototype end-to-end CNN
compute. system, which reestablished this idea in the new era of GPU
In contrast to its classical counterpart, an end-to-end au- computing [8]. Notable progress has been achieved with the
tonomous system offers several advantages. (a) The most appar- development of deep neural networks, both in imitation learn-
ent merit is its simplicity in combining perception, prediction, ing [15], [16] and reinforcement learning [4], [17], [18], [19].
and planning into a single model that can be jointly trained. (b) The policy distillation paradigm proposed in LBC [5] and related
The whole system, including its intermediate representations, approaches [20], [21], [22], [23] has significantly improved
is optimized towards the ultimate task. (c) Shared backbones closed-loop performance by mimicking a well-behaved expert.
increase computational efficiency. (d) Data-driven optimization To enhance generalization ability due to the discrepancy between
has the potential to improve the system by simply scaling train- the expert and learned policy, several papers [10], [24], [25] have
ing resources. proposed aggregating on-policy data [26] during training.
10166 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
Fig. 2. Roadmap of End-to-end Autonomous Driving. We present the key milestones chronologically, grouping similar works under the same theme. The
representative or first work is shown in bold with an illustration, while the date of the rest of the literature in the same theme may vary. We also display the score
for each year’s top entry in the CARLA leaderboard [13] (DS, ranging from 0 to 100) and the recent nuPlan challenge [14] (Score ranging from 0 to 1).
A significant turning point occurred around 2021. With di- motivation, methodologies, benchmarks, and more. In-
verse sensor configurations available within a reasonable com- stead of optimizing a single block, we advocate for a
putational budget, attention was focused on incorporating more philosophy to design the algorithm framework as a whole,
modalities and advanced architectures (e.g., Transformers [27]) with the ultimate target of achieving safe and comfortable
to capture global context and representative features, as in Trans- driving.
Fuser [6], [28] and many variants [29], [30], [31]. Combined b) We extensively investigate the critical challenges that
with more insights about the simulation environment, these concurrent approaches face. Out of the more than 270
advanced designs resulted in a substantial performance boost papers surveyed, we summarize major aspects and provide
on the CARLA benchmark [13]. To improve the interpretability in-depth analysis, including topics on generalizability,
and safety of autonomous systems, approaches [11], [32], [33] language-guided learning, causal confusion, etc.
explicitly involve various auxiliary modules to better supervise c) We cover the broader impact of how to embrace large
the learning process or utilize attention visualization. Recent foundation models and data engines. We believe that this
works prioritize generating safety-critical data [7], [34], [35], line of research and the large scale of high-quality data it
pre-training a foundation model or backbone curated for policy provides could significantly advance this field. To facilitate
learning [12], [36], [37], and advocating a modular end-to-end future research, we maintain an active repository updated
planning philosophy [1], [2], [38], [39]. Meanwhile, the new and with new literature and open-source projects.
challenging CARLA v2 [13] and nuPlan [14] benchmarks have
been introduced to facilitate research into this area. II. METHODS
This section reviews fundamental principles behind most
C. Comparison to Related Surveys existing end-to-end self-driving approaches. Section II-A dis-
We would like to clarify the difference between our survey cusses methods using imitation learning and provides details on
and previous related surveys [40], [41], [42], [43], [44], [45], the two most popular sub-categories, namely behavior cloning
[46], [47], [48]. Some prior surveys [40], [41], [42], [43] cover and inverse optimal control. Section II-B summarizes methods
content similar to ours in the sense of an end-to-end system. that follow the reinforcement learning paradigm.
However, they do not cover new benchmarks and approaches
that arose with the significant recent transition in the field, A. Imitation Learning
and place a minor emphasis on frontiers and challenges. The
Imitation learning (IL), also referred to as learning from
others focus on specific topics in this domain, such as imitation
demonstrations, trains an agent to learn the policy by
learning [44], [45], [46] or reinforcement learning [47], [48].
imitating the behavior of an expert. IL requires a dataset D =
In contrast, our survey provides up-to-date information on the
{ξi } containing trajectories collected under the expert’s policy
latest developments in this field, covering a wide span of topics
πβ , where each trajectory is a sequence of state-action pairs. The
and providing in-depth discussions of critical challenges.
goal of IL is to learn an agent policy π that matches πβ .
The policy π can output planned trajectories or control signals.
D. Contributions Early works usually adopt control outputs, due to the ease of
To summarize, this survey has three key contributions: collection. However, predicting controls at different steps could
a) We provide a comprehensive analysis of end-to-end au- lead to discontinuous maneuvers and the network inherently
tonomous driving for the first time, including high-level specializes to the vehicle dynamics which hinders generalization
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10167
to other vehicles. Another genre of works predicts waypoints. It a reasonable cost c(·) and use algorithmic trajectory samplers to
considers a relatively longer time horizon. Meanwhile, convert- select the trajectory τ ∗ with the minimum cost, as illustrated in
ing trajectories for vehicles to track into control signals needs Fig. 3.
additional controllers, which is non-trivial and involves vehicle Regarding cost design, it has representations including a
models and control algorithms. Since no clear performance gap learned cost volume in a bird’s-eye-view (BEV) [32], joint
has been observed between these two paradigms, we do not energy calculated from other agents’ future motion [69], or a
differentiate them explicitly in this survey. An interesting and set of probabilistic semantic occupancy or freespace layers [39],
more in-depth discussion can be found in [22]. [70], [71]. On the other hand, trajectories are typically sampled
One widely used category of IL is behavior cloning (BC) [49], from a fixed expert trajectory set [1] or processed by parameter
which reduces the problem to supervised learning. Inverse sampling with a kinematic model [32], [38], [39], [70]. Then,
Optimal Control (IOC), also known as Inverse Reinforcement a max-margin loss is adopted as in classic IOC methods to
Learning (IRL) [50] is another type of IL method that utilizes encourage the expert demonstration to have a minimal cost while
expert demonstrations to learn a reward function. We elaborate others have high costs.
on these two categories below. Several challenges exist with cost learning approaches. In
1) Behavior Cloning: In BC, matching the agent’s pol- particular, in order to generate more realistic costs, HD maps,
icy with the expert’s is accomplished by minimizing plan- auxiliary perception tasks, and multiple sensors are typically
ning loss as supervised learning over the collected dataset: incorporated, which increases the difficulty of learning and
E(s,a) (πθ (s), a). Here, (πθ (s), a) represents a loss function constructing datasets for multi-modal multi-task frameworks.
that measures the distance between the agent action and the Nevertheless, the aforementioned cost learning methods sig-
expert action. nificantly enhance the safety and interpretability of decisions
Early applications of BC for driving [3], [8], [51] utilized (see Section IV-F), and we believe that the industry-inspired
an end-to-end neural network to generate control signals from end-to-end system design is a viable approach for real-world
camera inputs. Further enhancements, such as multi-sensor in- applications.
puts [6], [52], auxiliary tasks [16], [28], and improved expert
design [21], have been proposed to enable BC-based end-to-end
driving models to handle challenging urban scenarios. B. Reinforcement Learning
BC is advantageous due to its simplicity and efficiency, as it Reinforcement learning (RL) [72], [73] is a field of learning
does not require hand-crafted reward design, which is crucial for by trial and error. The success of deep Q networks (DQN) [74]
RL. However, there are some common issues. During training, in achieving human-level control on the Atari benchmark [75]
it treats each state as independently and identically distributed, has popularized deep RL. DQN trains a neural network called
resulting in an important problem known as covariate shift. the critic (or Q network), which takes as input the current state
For general IL, several on-policy methods have been proposed and an action, and predicts the discounted return of that action.
to address this issue [26], [53], [54], [55]. In the context of The policy is then implicitly defined by selecting the action with
end-to-end autonomous driving, DAgger [26] has been adopted the highest predicted return.
in [5], [10], [25], [56]. Another common problem with BC RL requires an environment that allows potentially unsafe
is causal confusion, where the imitator exploits and relies on actions to be executed, to collect novel data (e.g., via ran-
false correlations between certain input components and output dom actions). Additionally, RL requires significantly more data
signals. This issue has been discussed in the context of end- to train than IL. For this reason, modern RL methods often
to-end autonomous driving in [57], [58], [59], [60]. These two parallelize data collection across multiple environments [76].
challenging problems are further discussed in Section IV-I and Meeting these requirements in the real world presents great chal-
Section IV-H, respectively. lenges. Therefore, almost all papers that use RL in driving have
2) Inverse Optimal Control: Traditional IOC algorithms only investigated the technique in simulation. Most use different
learn an unknown reward function R(s, a) from expert demon- extensions of DQN. The community has not yet converged on a
strations, where the expert’s reward function can be represented specific RL algorithm.
as a linear combination of features [50], [61], [62], [63], [64]. RL has successfully learned lane following on a real car on
However, in continuous, high-dimensional autonomous driving an empty street [4]. Despite this encouraging result, it must
scenarios, the definition of the reward is implicit and difficult to be noted that a similar task was already accomplished by IL
optimize. three decades prior [3]. To date, no report has shown results for
Generative adversarial imitation learning [65], [66], [67] is end-to-end training with RL that are competitive with IL. The
a specialized approach in IOC that designs the reward function reason for this failure likely is that the gradients obtained via
as an adversarial objective to distinguish the expert and learned RL are insufficient to train deep perception architectures (i.e.,
policies, similar to the concept of generative adversarial net- ResNet) required for driving. Models used in benchmarks like
works [68]. Recently, several works propose optimizing a cost Atari, where RL succeeds, are relatively shallow, consisting of
volume or cost function with auxiliary perceptual tasks. Since only a few layers [77].
a cost is an alternative representation of the reward, we classify RL has been successfully applied in end-to-end driving
these methods as belonging to the IOC domain. We define the when combined with supervised learning (SL). Implicit affor-
cost learning framework as follows: end-to-end approaches learn dances [18], [19] pre-train the CNN encoder using SL with tasks
10168 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
Fig. 3. Overview of methods in end-to-end autonomous driving. We illustrate three popular paradigms, including two imitation learning frameworks (behavior
cloning and inverse optimal control), as well as online reinforcement learning.
like semantic segmentation. In the second stage, this encoder is in prize money for autonomously navigating a 240 km route
frozen, and a shallow policy head is trained on the features from through the Mojave desert, which no team achieved [85]. The
the frozen encoder with a modern version of Q-learning [78]. RL final series event, called the DARPA Urban Challenge, required
can also be used to finetune full networks that were pre-trained vehicles to navigate a 96 km mock-up town course, adhering
using IL [17], [79]. to traffic laws and avoiding obstacles [86]. These races fostered
RL can also been effectively applied, if the network has access important developments in autonomous driving, such as LiDAR
to privileged simulator information. [48], [80], [81]. Privileged sensors. Following this spirit, the University of Michigan estab-
RL agents can be used for dataset curation. Roach [21] trains an lished MCity [87], a large controlled real-world environment
RL agent on privileged BEV semantic maps and uses the policy designed to facilitate testing autonomous vehicles. However,
to automatically collect a dataset with which a downstream IL such academic ventures have not been widely employed for end-
agent is trained. WoR [20] employs a Q-function and tabular to-end systems due to a lack of data and vehicles. In contrast, in-
dynamic programming to generate additional or improved labels dustries with the resources to deploy fleets of driverless vehicles
for a static dataset. could rely on real-world evaluation to benchmark improvements
A challenge in the field is to transfer the findings from in their algorithms.
simulation to the real world. In RL, the objective is expressed as
reward functions, and many algorithms require them to be dense
and provide feedback at each environment step. Current works B. Online/Closed-Loop Simulation
typically use simple objectives, such as progress and collision Conducting tests of self-driving systems in the real world
avoidance. These simplistic designs potentially encourage risky is costly and risky. To address this challenge, simulation is a
behaviors [80]. Devising or learning better reward functions viable alternative [14], [88], [89], [90], [91], [92]. Simulators
remains an open problem. Another direction would be to develop facilitate rapid prototyping and testing, enable the quick iteration
RL algorithms that can handle sparse rewards, enabling the opti- of ideas, and provide low-cost access to diverse scenarios for
mization of relevant metrics directly. RL can be effectively com- unit testing. In addition, simulators offer tools for measuring
bined with world models [82], [83], [84], though this presents performance accurately. However, their primary disadvantage
specific challenges (See Section IV-C). Current RL solutions for is that the results obtained in a simulated environment do not
driving rely heavily on low-dimensional representations of the necessarily generalize to the real world (Section IV-I-3).
scene, and this issue is further discussed in Section IV-B-2. Closed-loop evaluation involves building a simulated environ-
ment that closely mimics a real-world driving environment. The
III. BENCHMARKING evaluation entails deploying the driving system in simulation and
measuring its performance. The system has to navigate safely
Autonomous driving systems require a comprehensive eval-
through traffic while progressing toward a designated goal loca-
uation to ensure safety. Researchers must benchmark these sys-
tion. There are four main sub-tasks involved in developing such
tems using appropriate datasets, simulators, metrics, and hard-
simulators: parameter initialization, traffic simulation, sensor
ware to accomplish this. This section delineates three approaches
simulation, and vehicle dynamics simulation. We briefly de-
for benchmarking end-to-end autonomous driving systems: (1)
scribe these sub-tasks below, followed by a summary of currently
real-world evaluation, (2) online or closed-loop evaluation in
available open-source simulators for closed-loop benchmarks.
simulation, and (3) offline or open-loop evaluation on driving
1) Parameter Initialization: Simulation offers the benefit of
datasets. We focus on the scalable and principled online simula-
a high degree of control over the environment, including weather,
tion setting and summarize real-world and offline assessments
maps, 3D assets, and low-level attributes such as the arrangement
for completeness.
of objects in a traffic scene. While powerful, the number of
these parameters is substantial, resulting in a challenging design
A. Real-World Evaluation
problem. Current simulators tackle this in two ways:
Early efforts on benchmarking self-driving involved real- Procedural Generation: Traditionally, initial parameters are
world evaluation. Notably, DARPA initiated a series of races hand-tuned by 3D artists and engineers [88], [89], [90],
to advance autonomous driving. The first event offered $1M [91]. This limits scalability. Recently, some of the simulation
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10169
sensory layout and fusing them to complement each other for also formulate the driving task as a question-answering problem
autonomous driving. and construct corresponding benchmarks [171], [172]. They
Multi-sensor fusion has predominantly been discussed in highlight that LLMs offer opportunities to handle sophisticated
perception-related fields, e.g., object detection [131], [132] instructions and generalize to different data domains, which
and semantic segmentation [133], [134], and is typically share similar advantages to applications in robotic areas [173].
categorized into three groups: early, mid, and late fusion. However, LLMs for on-road driving could be challenging at
End-to-end autonomous driving algorithms explore similar present, considering their long inference time, low quantitative
fusion schemes. Early fusion combines sensory inputs before accuracy, and instability of outputs. Potential resolutions could
feeding them into shared feature extractors, where concatenation be employing LLMs on the cloud specifically for complex sce-
is a common way for fusion [32], [135], [136], [137], [138]. To narios and using them solely for high-level behavior prediction.
resolve the view discrepancy, some works project point clouds
on images [139] or vice versa (predicting semantic labels for
LiDAR points [52], [140]). On the other hand, late fusion
combines multiple results from multi-modalities. It is less B. Dependence on Visual Abstraction
discussed due to its inferior performance [6], [141]. Contrary End-to-end autonomous driving systems roughly have two
to these methods, middle fusion achieves multi-sensor fusion stages: encoding the state into a latent feature representation,
within the network by separately encoding inputs and then and then decoding the driving policy with intermediate fea-
fusing them at the feature level. Naive concatenation is also tures. In urban driving, the input state, i.e., the surrounding
frequently adopted [15], [22], [30], [142], [143], [144], [145], environment and ego state, is much more diverse and high-
[146]. Recently, works have employed Transformers [27] to dimensional compared to common policy learning benchmarks
model interactions among features [6], [28], [29], [147], [148]. such as video games [18], [174], which might lead to the
The attention mechanism in Transformers has demonstrated misalignment between representations and necessary attention
great effectiveness in aggregating the context of different sensor areas for policy making. Hence, it is helpful to design “good”
inputs and achieving safer end-to-end driving. intermediate perception representations, or first pre-train visual
Inspired by the progress in perception, it is beneficial to model encoders using proxy tasks. This enables the network to extract
modalities in a unified space such as BEV [131], [132]. End- useful information for driving effectively, thus facilitating the
to-end driving also requires identifying policy-related contexts subsequent policy stage. Furthermore, this can improve the
and discarding irrelevant details. We discuss perception-based sample efficiency for RL methods.
representations in Section IV-B-1. Besides, the self-attention 1) Representation Design: Naive representations are ex-
layer, interconnecting all tokens freely, incurs a significant tracted with various backbones. Classic convolutional neural
computational cost and cannot guarantee useful information networks (CNNs) still dominate, with advantages in transla-
extraction. Advanced Transformer-based fusion mechanisms in tion equivariance and high efficiency [175]. Depth-pre-trained
the perception field, such as [149], [150], hold promise for CNNs [176] significantly boost perception and downstream
application to the end-to-end driving task. performance. In contrast, Transformer-based feature extrac-
2) Language as Input: Humans drive using both visual per- tors [177], [178] show great scalability in perception tasks while
ception and intrinsic knowledge which together form causal be- not being widely adopted for end-to-end driving yet. For driving-
haviors. In areas related to autonomous driving such as embodied specific representations, researchers introduce the concept of
AI, incorporating natural language as fine-grained knowledge bird’s-eye-view (BEV), fusing different sensor modalities and
and instructions to control the visuomotor agent has achieved temporal information within a unified 3D space [131], [132],
notable progress [151], [152], [153], [154]. However, compared [179], [180]. It also facilitates easy adaptions to downstream
to robotic applications, the driving task is more straightforward tasks [2], [30], [181]. In addition, grid-based 3D occupancy is
without the need for task decomposition, and the outdoor envi- developed to capture irregular objects and used for collision
ronment is much more complex with highly dynamic agents but avoidance in planning [182]. Nevertheless, the dense representa-
few distinctive anchors for grounding. tion brings huge computation costs compared to BEV methods.
To incorporate linguistic knowledge into driving, a few Another unsettled problem is representations of the map.
datasets are proposed to benchmark outdoor grounding and Traditional autonomous driving relies on HD Maps. Due to the
visual language navigation tasks [155], [156], [157], [158]. high cost of availability of HD Maps, online mapping methods
HAD [159] takes human-to-vehicle advice and adds a visual have been devised with different formulations, such as BEV
grounding task. Sriram et al. [160] translate natural language segmentation [183], vectorized lanlines [184], centerlines and
instructions into high-level behaviors, while [161], [162] di- their topology [185], [186], and lane segments [187]. However,
rectly ground the texts. CLIP-MC [163] and LM-Nav [164] the most suitable formulation for end-to-end systems remains
utilize CLIP [165] to extract both linguistic knowledge from unvalidated.
instructions and visual features from images. Though various representation designs offer possibilities of
Recently, observing the rapid development of large language how to design the subsequent decision-making process, they
models (LLMs) [166], [167], works encode the perceived scene also place challenges as co-designing both parts is necessary
into tokens and prompt them to LLMs for control prediction for a whole framework. Besides, given the trends observed in
and text-based explanations [168], [169], [170]. Researchers several simple yet effective approaches with scaling up training
10172 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
resources [22], [28], the ultimate necessity of explicit represen- transition dynamics into a non-reactive world model and a sim-
tations such as maps is uncertain. ple kinematic bicycle model. In [137], a probabilistic sequential
2) Representation Learning: Representation learning often latent model is used as the world model. To address the potential
incorporates certain inductive biases or prior information. There inaccuracy of the learned world model, Henaff et al. [200] train
inevitably exist possible information bottlenecks in the learned the policy network with dropout regularization to estimate the
representation, and redundant context unrelated to decisions uncertainty cost. Another approach [201] uses an ensemble of
may be removed. multiple world models to provide uncertainty estimation, based
Some early methods directly utilize semantic segmentation on which imaginary rollouts could be truncated and adjusted
masks from off-the-shelf networks as the input representation accordingly. Motivated by Dreamer [82], ISO-Dream [202]
for subsequent policy training [188], [189]. SESR [190] further decouples visual dynamics into controllable and uncontrollable
encodes segmentation masks into class-disentangled representa- states, and trains the policy on the disentangled states.
tions through a VAE [191]. In [192], [193], predicted affordance It is worth noting that learning world models in raw image
indicators, such as traffic light states, offset to the lane center, space is non-trivial for autonomous driving. Important small
and distance to the leading vehicle, are used as representations details, such as traffic lights, would easily be missed in pre-
for policy learning. dicted images. To tackle this, GenAD [203] and DriveWM [204]
Observing that results like segmentation as representations employ the prevailing diffusion technique [205]. MILE [206]
can create bottlenecks defined by humans and result in loss incorporates the Dreamer-style world model learning in the
of useful information, some have chosen intermediate features BEV segmentation space as an auxiliary task besides imitation
from pre-training tasks as effective representations for RL train- learning. SEM2 [136] also extends the Dreamer structure but
ing [18], [19], [194], [195]. In [196], latent features in VAE with BEV map inputs, and uses RL for training. Besides directly
are augmented by attention maps obtained from the diffused using the learned world model for MBRL, DeRL [195] combines
boundary of segmentation and depth maps to highlight important a model-free actor-critic framework with the world model, by
regions. TARP [197] utilizes data from a series of previous tasks fusing self-assessments of the action or state from both models.
to perform different tasks-related prediction tasks to acquire use- World model learning for end-to-end autonomous driving
ful representations. In [198], the latent representation is learned is an emerging and promising direction as it greatly reduces
by approximating the π-bisimulation metric, which is comprised the sample complexity for RL, and understanding the world
of differences of rewards and outputs from the dynamics model. is helpful for driving. However, as the driving environment is
ACO [36] learns discriminative features by adding steering angle highly complex and dynamic, further study is still needed to
categorization into the contrastive learning structure. Recently, determine what needs to be modeled and how to model the world
PPGeo [12] proposes to learn effective representation through effectively.
motion prediction together with depth estimation in a self-
supervised way on uncalibrated driving videos. ViDAR [199]
utilizes the raw image-point cloud pairs and pretrains the visual D. Reliance on Multi-Task Learning
encoder with a point cloud forecasting pre-task. These works Multi-task learning (MTL) involves jointly performing sev-
demonstrate that self-supervised representation learning from eral related tasks based on a shared representation through
large-scale unlabeled data for policy learning is promising and separate heads. MTL provides advantages such as computational
worthy of future exploration. cost reduction, the sharing of relevant domain knowledge, and
the ability to exploit task relationships to improve model’s gen-
eralization ability [207]. Consequently, MTL is well-suited for
C. Complexity of World Modeling for Model-Based RL end-to-end driving, where the ultimate policy prediction requires
Besides the ability to better abstract perceptual representa- a comprehensive understanding of the environment. However,
tions, it is essential for end-to-end models to make reasonable the optimal combination of auxiliary tasks and appropriate
predictions about the future to take safe maneuvers. In this sec- weighting of losses to achieve the best performance presents
tion, we mainly discuss the challenges of current model-based a significant challenge.
policy learning works, where a world model provides explicit In contrast to common vision tasks where dense predictions
future predictions for the policy model. are closely correlated, end-to-end driving predicts a sparse sig-
Deep RL typically suffers from the high sample complex- nal. The sparse supervision increases the difficulty of extracting
ity, which is pronounced in autonomous driving. Model-based useful information for decision-making in the encoder. For
reinforcement learning (MBRL) offers a promising direction to image input, auxiliary tasks such as semantic segmentation [28],
improve sample efficiency by allowing agents to interact with the [31], [139], [208], [209], [210] and depth estimation [28],
learned world model instead of the actual environment. MBRL [31], [208], [209], [210] are commonly adopted in end-to-end
methods employ an explicit world (environment) model, which autonomous driving models. Semantic segmentation helps the
is composed of transition dynamics and reward functions. This model gain a high-level understanding of the scene; depth esti-
is particularly helpful in driving, as simulators like CARLA are mation enables the model to capture the 3D geometry of the envi-
relatively slow. ronment and better estimate distances to critical objects. Besides
However, modeling the highly dynamic environment is a chal- auxiliary tasks on perspective images, 3D object detection [28],
lenging task. To simplify the problem, Chen et al. [20] factor the [31], [52] is also useful for LiDAR encoders. As BEV becomes
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10173
I. Lack of Robustness
1) Long-Tailed Distribution: One important aspect of the
long-tailed distribution problem is dataset imbalance, where a
few classes make up the majority, as shown in Fig. 8(a). This
poses a big challenge for models to generalize to diverse environ-
Fig. 7. Causal Confusion. The current action of a car is strongly correlated ments. Various methods mitigate this issue with data processing,
with low-dimensional spurious features such as the velocity or the car’s past
trajectory. End-to-End models may latch on to them leading to causal confusion. including over-sampling [246], [247], under-sampling [248],
[249], and data augmentation [250], [251]. Besides, weighting-
based approaches [252], [253] are also commonly used.
confusion [242], where access to more information leads to In the context of end-to-end autonomous driving, the long-
worse performance. tailed distribution issue is particularly severe. Most drives are
Causal confusion in imitation learning has been a persistent repetitive and uninteresting e.g., following a lane for many
challenge for nearly two decades. One of the earliest reports of frames. Conversely, interesting safety-critical scenarios occur
this effect was made by LeCun et al. [243]. They used a single rarely but are diverse in nature, and hard to replicate in the
input frame for steering prediction to avoid such extrapolation. real world for safety reasons. To tackle this, some works rely
Though simplistic, this is still a preferred solution in current on handcrafted scenarios [13], [100], [254], [255], [256] to
state-of-the-art IL methods [22], [28]. Unfortunately, using a generate more diverse data in simulation. LBC [5] leverages the
single frame makes it hard to extract the motion of surrounding privileged agent to create imaginary supervisions conditioned on
actors. Another source of causal confusion is speed measure- different navigational commands. LAV [52] includes trajectories
ment [16]. Fig. 7 showcases an example of a car waiting at a red of non-ego agents for training to promote data diversity. In [257],
light. The action of the car could highly correlate with its speed a simulation framework is proposed to apply importance-
because it has waited for many frames where the speed is zero sampling strategies to accelerate the evaluation of rare-event
and the action is the brake. Only when the traffic light changes probabilities.
from red to green does this correlation break down. Another line of research [7], [34], [35], [258], [259], [260]
There are several approaches to combat the causal confusion generates safety-critical scenarios in a data-driven manner
problem when using multiple frames. In [57], the authors attempt through adversarial attacks. In [258], Bayesian Optimization is
to remove spurious temporal correlations from the bottleneck employed to generate adversarial scenarios. Learning to col-
representation by training an adversarial model that predicts lide [35] represents driving scenarios as the joint distribution
the ego agent’s past action. Intuitively, the resulting min-max over building blocks and applies policy gradient RL methods to
optimization trains the network to eliminate its past from in- generate risky scenarios. AdvSim [34] modifies agents’ trajecto-
termediate layers. It works well in MuJoCo but does not scale ries to cause failures, while still adhering to physical plausibility.
to complex vision-based driving. OREO [59] maps images to KING [7] proposes an optimization algorithm for safety-critical
discrete codes representing semantic objects and applies random perturbations using gradients through differentiable kinematics
dropout masks to units that share the same discrete code, which models.
helps in confounded Atari. In end-to-end driving, Chauffeur- In general, efficiently generating realistic safety-critical sce-
Net [244] addresses the causal confusion issue by using the past narios that cover the long-tailed distribution remains a significant
ego-motion as intermediate BEV abstractions and dropping out challenge. While many works focus on adversarial scenarios
it with a 50% probability during training. Wen et al. [58] propose in simulators, it is also essential to better utilize real-world
upweighting keyframes in the training loss, where a decision data for critical scenario mining and potential adaptation to
change occurs (and hence are not predictable by extrapolating simulation. Besides, a systematic, rigorous, comprehensive, and
the past). PrimeNet [60] improves performance compared to realistic testing framework is crucial for evaluating end-to-end
keyframes by using an ensemble, where the prediction of a autonomous driving methods under these long-tailed distributed
single-frame model is given as additional input to a multi-frame safety-critical scenarios.
model. Chuang et al. [245] do the same but supervise the 2) Covariate Shift: As discussed in Section II-A, one impor-
multi-frame network with action residuals instead of actions. In tant challenge for behavior cloning is covariate shift. The state
addition, the problem of causal confusion can be circumvented distributions from the expert’s policy and those from the trained
by using only LiDAR histories (with a single frame image) agent’s policy differ, leading to compounding errors when the
and realigning point clouds into one coordinate system. This trained agent is deployed in unseen testing environments or when
removes ego-motion while retaining information about other the reactions from other agents differ from training time. This
10176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
Fig. 8. Challenges in robustness. Three primary generalization issues arise in relation to dataset distribution discrepancies, namely long-tailed and normal cases,
expert demonstration and test scenarios, and domain shift in locations, weather, etc.
could result in the trained agent being in a state that is outside domains into a common latent space or representations like seg-
the expert’s distribution for training, leading to severe failures. mentation maps [262], [263]. LUSR [264] and UAIL [235] adopt
An illustration is presented in Fig. 8(b). a Cycle-Consistent VAE and GAN, respectively, to project im-
DAgger (Dataset Aggregation) [26] is a common solution for ages into a latent representation comprised of a domain-specific
this issue. DAgger is an iterative training process. The current part and a domain-general part. In SESR [190], class disentan-
trained policy is rolled out in each iteration to collect new gled encodings are extracted from a semantic segmentation mask
data, and the expert is used to label the visited states. This to reduce the sim-to-real gap. Domain randomization [265],
enriches the dataset by adding examples of how to recover from [266], [267] is also a simple and effective sim-to-real technique
suboptimal states that an imperfect policy might visit. The policy for RL policy learning, which is further adapted for end-to-end
is then trained on the augmented dataset, and the process repeats. autonomous driving [188], [268]. It is realized by randomizing
However, one downside of DAgger is the need for an available the rendering and physical settings of the simulators to cover the
expert to query online. variability of the real world during training.
For end-to-end autonomous driving, DAgger is adopted Currently, sim-to-real adaptation through source target image
in [24] with an MPC-based expert. To reduce the cost of con- mapping or domain-invariant feature learning is the focus. Other
stantly querying the expert, SafeDAgger [25] extends the origi- DA cases are handled by constructing a diverse and large-scale
nal DAgger algorithm by learning a safety policy that estimates dataset. Given that current methods mainly concentrate on the
the deviation between the current policy and the expert policy. visual gap in images, and LiDAR has become a popular input
The expert is only queried when the deviation is large. MetaDAg- modality for driving, specific adaptation techniques tailored for
ger [56] uses meta-learning with DAgger to aggregate data from LiDARs must also be designed. Besides, traffic agents’ behavior
multiple environments. LBC [5] adopts DAgger and resamples gaps between the simulator and the real world should be noticed
the data with higher loss more frequently. In DARB [10], to as well. Incorporating real-world data into simulation through
better utilize failure or safety-related samples, it proposes several techniques such as NeRF [113] is another promising direction.
mechanisms, including task-based, policy-based, and policy &
expert-based mechanisms, to sample such critical states. V. FUTURE TRENDS
3) Domain Adaptation: Domain adaptation (DA) is a type
of transfer learning in which the target task is the same as the Considering the challenges and opportunities discussed, we
source task, but the domains differ. Here we discuss scenarios list some crucial directions for future research that may have a
where labels are available for the source domain while there are broader impact in this field.
no labels or a limited amount of labels available for the target
domain. A. Zero-Shot and Few-Shot Learning
As shown in Fig. 8(c), domain adaptation for autonomous It is inevitable for autonomous driving models to eventually
driving tasks encompasses several cases [261]:
r Sim-to-real: the large gap between simulators used for encounter real-world scenarios that lie beyond the training data
distribution. This raises the question of whether we can suc-
training and the real world used for deployment.
r Geography-to-geography: different geographic locations cessfully adapt the model to an unseen target domain where
limited or no labeled data is available. Formalizing this task
with varying environmental appearances.
r Weather-to-weather: changes in sensor inputs caused by for the end-to-end driving domain and incorporating techniques
from the zero-shot/few-shot learning literature are the key steps
weather conditions such as rain, fog, and snow.
r Day-to-night: illumination variations in visual inputs. toward achieving this [269], [270].
r Sensor-to-sensor: possible differences in sensor character-
istics, e.g., resolution and relative position. B. Modular End-to-End Planning
Note that the aforementioned cases often overlap. The modular end-to-end planning framework optimizes mul-
Typically, domain-invariant feature learning is achieved with tiple modules while prioritizing the ultimate planning task,
image translators and discriminators to map images from two which enjoys the advantages of interpretability as indicated in
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10177
Section IV-F. This is advocated in recent literature [2], [271] Therefore, an increasing number of companies have started ex-
and certain industry solutions (Tesla, Wayve, etc.) have involved ploring end-to-end autonomous driving techniques specifically
similar ideas. When designing these differentiable perception tailored for these environments. It is envisioned that with ex-
modules, several questions arise regarding the choice of loss tensive high-quality data collection, large-scale model training,
functions, such as the necessity of 3D bounding boxes for and the establishment of reliable benchmarks, the end-to-end
object detection, whether opting for BEV segmentation over lane approach will have enormous potential over modular stacks in
topology for static scene perception, or the training strategies terms of performance and effectiveness. In summary, end-to-end
with limited modules’ data. autonomous driving faces great opportunities and challenges
simultaneously, with the ultimate goal of building generalist
agents. In this era of emerging technologies, we hope this survey
C. Data Engine could serve as a starting point to shed new light on this domain.
The importance of large-scale and high-quality data for au-
tonomous driving can never be emphasized enough [272]. Estab-
ACKNOWLEDGMENT
lishing a data engine with an automatic labeling pipeline [273]
could greatly facilitate the iterative development of both data The authors would like to thank the International Max Planck
and models. The data engine for autonomous driving, especially Research School for Intelligent Systems (IMPRS-IS) for sup-
modular end-to-end planning systems, needs to streamline the porting Bernhard Jaeger and Kashyap Chitta.
process of annotating high-quality perception labels with the
aid of large perception models in an automatic way. It should
REFERENCES
also support mining hard/corner cases, scene generation, and
editing to facilitate the data-driven evaluations discussed in [1] S. Casas, A. Sadat, and R. Urtasun, “MP3: A unified model to map,
perceive, predict and plan,” in Proc. IEEE Conf. Comput. Vis. Pattern
Section III-B and promote diversity of data and the generaliza- Recognit., 2021, pp. 14403–14412.
tion ability of models (Section IV-I). A data engine would enable [2] Y. Hu et al., “Planning-oriented autonomous driving,” in Proc. IEEE
autonomous driving models to make consistent improvements. Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17853–17862.
[3] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural net-
work,” in Proc. Int. Conf. Neural Inf. Process. Syst., 1988, pp. 305–313.
[4] A. Kendall et al., “Learning to drive in a day,” in Proc. IEEE Int. Conf.
D. Foundation Model Robot. Automat., 2019, pp. 8248–8254.
[5] D. Chen, B. Zhou, V. Koltun, and P. Krähenbühl, “Learning by cheating,”
Recent advancements in foundation models in both lan- in Proc. Conf. Robot Learn., 2020, pp. 66–75.
guage [166], [167], [274] and vision [273], [275], [276] have [6] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer
proved that large-scale data and model capacity can unleash for end-to-end autonomous driving,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2021, pp. 7077–7087.
the immense potential of AI in high-level reasoning tasks. The [7] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger,
paradigm of finetuning [277] or prompt learning [278], opti- “King: Generating safety-critical driving scenarios for robust imita-
mization in the form of self-supervised reconstruction [279] or tion via kinematics gradients,” in Proc. Eur. Conf. Comput. Vis., 2022,
pp. 335–352.
contrastive pairs [165], etc., are all applicable to the end-to-end [8] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
driving domain. However, we contend that the direct adoption of arXiv:1604.07316.
LLMs for driving might be tricky. The output of an autonomous [9] F. Codevilla, M. Müller, A. López, V. Koltun, and A. Dosovitskiy, “End-
to-End driving via conditional imitation learning,” in Proc. IEEE Int.
agent requires steady and accurate measurements, whereas the Conf. Robot. Automat., 2018, pp. 4693–4700.
generative output in language models aims to behave like hu- [10] A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger, “Exploring
mans, irrespective of its accuracy. A feasible solution to develop data aggregation in policy learning for vision-based urban autonomous
driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
a “foundation” driving model is to train a world model that can pp. 11763–11773.
forecast the reasonable future of the environment, either in 2D, [11] K. Chitta, A. Prakash, and A. Geiger, “Neat: Neural attention fields for
3D, or latent space. To perform well on downstream tasks like end-to-end autonomous driving,” in Proc. IEEE Int. Conf. Comput. Vis.,
2021, pp. 15793–15803.
planning, the objective to be optimized for the model needs to [12] P. Wu, L. Chen, H. Li, X. Jia, J. Yan, and Y. Qiao, “Policy pre-training
be sophisticated enough, beyond frame-level perception. for autonomous driving via self-supervised geometric modeling,” in Proc.
Int. Conf. Learn. Representations, 2023.
[13] CARLA, “CARLA autonomous driving leaderboard,” 2022. [Online].
VI. CONCLUSION AND OUTLOOK Available: https://leaderboard.carla.org/
[14] H. Caesar et al., “NuPlan: A closed-loop ML-based planning benchmark
In this survey, we provide an overview of fundamental for autonomous vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. Workshops, 2021.
methodologies and summarize various aspects of simulation [15] J. Hawke et al., “Urban driving with conditional imitation learning,” in
and benchmarking. We thoroughly analyze the extensive Proc. IEEE Int. Conf. Robot. Automat., 2020.
literature to date, and highlight a wide range of critical [16] F. Codevilla, E. Santana, A. M. López, and A. Gaidon, “Exploring the
limitations of behavior cloning for autonomous driving,” in Proc. IEEE
challenges and promising resolutions. Int. Conf. Comput. Vis., 2019, pp. 9329–9338.
Outlook: The industry has dedicated considerable effort over [17] X. Liang, T. Wang, L. Yang, and E. Xing, “CIRL: Controllable imitative
the years to develop advanced modular-based systems capa- reinforcement learning for vision-based self-driving,” in Proc. Eur. Conf.
Comput. Vis., 2018, pp. 584–599.
ble of achieving autonomous driving on highways. However, [18] M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-end model-free
these systems face significant challenges when confronted with reinforcement learning for urban driving using implicit affordances,” in
complex scenarios, e.g., inner-city streets and intersections. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7153–7162.
10178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
[19] R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde, “GRI: Gen- [45] L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis, “A survey on
eral reinforced imitation and its application to vision-based autonomous imitation learning techniques for end-to-end autonomous vehicles,”
driving,” Robotics, vol. 12, 2023, Art. no. 217. IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 14128–14147,
[20] D. Chen, V. Koltun, and P. Krähenbühl, “Learning to drive from a world Sep. 2022.
on rails,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 15590–15599. [46] B. Zheng, S. Verma, J. Zhou, I. W. Tsang, and F. Chen, “Imitation
[21] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end urban learning: Progress, taxonomies and challenges,” IEEE Trans. Neural
driving by imitating a reinforcement learning coach,” in Proc. IEEE Int. Netw. Learn. Syst., vol. 35, no. 5, pp. 6322–6337, May 2024.
Conf. Comput. Vis., 2021, pp. 15222–15232. [47] Z. Zhu and H. Zhao, “A survey of deep RL and IL for autonomous
[22] P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y. Qiao, “Trajectory-guided driving policy learning,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9,
control prediction for end-to-end autonomous driving: A simple yet pp. 14043–14065, Sep. 2022.
strong baseline,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, [48] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:
pp. 6119–6132. A survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,
[23] J. Zhang, Z. Huang, and E. Ohn-Bar, “Coaching a teachable student,” in Jun. 2022.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7805–7815. [49] M. Bain and C. Sammut, “A framework for behavioural cloning,” Mach.
[24] Y. Pan et al., “Agile autonomous driving using end-to-end deep imitation Intell., vol. 15, 1995.
learning,” in Proc. Robotics: Sci. Sys. Conf., 2017. [50] B. D. Ziebart et al., “Maximum entropy inverse reinforcement learning,”
[25] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end in Proc. AAAI Conf. Artif. Intell., 2008, pp. 1433–1438.
simulated driving,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 2891– [51] Y. Lecun, E. Cosatto, J. Ben, U. Muller, and B. Flepp, “DAVE: Au-
2897. tonomous off-road vehicle control using end-to-end learning,” Courant
[26] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning Institute/CBLL, Tech. Rep. DARPA-IPTO Final Report, 2004. [Online].
and structured prediction to no-regret online learning,” in Proc. Int. Conf. Available: http://www.cs.nyu.edu/\∼{ }yann/research/dave/index.html
Artif. Intell. Statist., 2011, pp. 627–635. [52] D. Chen and P. Krähenbühl, “Learning from all vehicles,” in Proc. IEEE
[27] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17222–17231.
Inf. Process. Syst., 2017, pp. 6000–6010. [53] K. Judah, A. P. Fern, T. G. Dietterich, and P. Tadepalli, “Active imitation
[28] K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Trans- learning: Formal and practical reductions to IID learning,” J. Mach.
fuser: Imitation with transformer-based sensor fusion for autonomous Learn. Res., vol. 15, pp. 4105–4143, 2014.
driving,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, [54] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in
pp. 12878–12895, Nov. 2023. Proc. Int. Conf. Artif. Intell. Statist., 2010, pp. 661–668.
[29] H. Shao, L. Wang, R. Chen, H. Li, and Y. Liu, “Safety-enhanced au- [55] S. Ross and J. A. Bagnell, “Reinforcement and imitation learning via
tonomous driving using interpretable sensor fusion transformer,” in Proc. interactive no-regret learning,” 2014, arXiv:1406.5979.
Conf. Robot Learn., 2022, pp. 726–737. [56] A. E. Sallab, M. Saeed, O. A. Tawab, and M. Abdou, “Meta learning
[30] X. Jia et al., “Think twice before driving: Towards scalable decoders framework for automated driving,” 2017, 1706.04038.
for end-to-end autonomous driving,” in Proc. IEEE Conf. Comput. Vis. [57] C. Wen, J. Lin, T. Darrell, D. Jayaraman, and Y. Gao, “Fighting copycat
Pattern Recognit., 2023, pp. 21983–21994. agents in behavioral cloning from observation histories,” in Proc. Int.
[31] B. Jaeger, K. Chitta, and A. Geiger, “Hidden biases of end-to- Conf. Neural Inf. Process. Syst., 2020, pp. 2564–2575.
end driving models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, [58] C. Wen, J. Lin, J. Qian, Y. Gao, and D. Jayaraman, “Keyframe-focused
pp. 8240–8249. visual imitation learning,” in Proc. Int. Conf. Mach. Learn., 2021,
[32] W. Zeng et al., “End-to-end interpretable neural motion planner,” in Proc. pp. 11123–11133.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8660–8669. [59] J. Park et al., “Object-aware regularization for addressing causal confu-
[33] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual sion in imitation learning,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
explanations for self-driving vehicles,” in Proc. Eur. Conf. Comput. Vis., 2021, pp. 3029–3042.
2018, pp. 563–578. [60] C. Wen, J. Qian, J. Lin, J. Teng, D. Jayaraman, and Y. Gao, “Fighting fire
[34] J. Wang et al., “Advsim: Generating safety-critical scenarios for self- with fire: Avoiding DNN shortcuts through priming,” in Proc. Int. Conf.
driving vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Mach. Learn., 2022, pp. 23723–23750.
2021, pp. 9909–9918. [61] D. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating
[35] W. Ding, B. Chen, M. Xu, and D. Zhao, “Learning to collide: An adaptive beyond suboptimal demonstrations via inverse reinforcement learn-
safety-critical scenarios generating method,” in Proc. IEEE/RSJ Int. Conf. ing from observations,” in Proc. Int. Conf. Mach. Learn., 2019,
Intell. Robots Syst., 2020, pp. 2243–2250. pp. 783–792.
[36] Q. Zhang, Z. Peng, and B. Zhou, “Learning to drive by watching youtube [62] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse
videos: Action-conditioned contrastive policy pretraining,” in Proc. Eur. optimal control via policy optimization,” in Proc. Int. Conf. Mach. Learn.,
Conf. Comput. Vis., 2022, pp. 111–128. 2016, pp. 49–58.
[37] J. Zhang, R. Zhu, and E. Ohn-Bar, “SelfD: Self-learning large-scale [63] S. Reddy, A. D. Dragan, and S. Levine, “SQIL: Imitation learning via
driving policies from the web,” in Proc. IEEE Conf. Comput. Vis. Pattern reinforcement learning with sparse rewards,” 2019, arXiv:1905.11108.
Recognit., 2022, pp. 17316–17326. [64] S. Luo, H. Kasaei, and L. Schomaker, “Self-imitation learning by plan-
[38] S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “ST-P3: End-to-end ning,” in Proc. IEEE Int. Conf. Robot. Automat., 2021, pp. 4823–4829.
vision-based autonomous driving via spatial-temporal feature learning,” [65] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Proc.
in Proc. Eur. Conf. Comput. Vis., 2022, pp. 533–549. Int. Conf. Neural Inf. Process. Syst., 2016, pp. 4572–4580.
[39] A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun, “Perceive, [66] Y. Li, J. Song, and S. Ermon, “InfoGAIL: Interpretable imitation learning
predict, and plan: Safe motion planning through interpretable semantic from visual demonstrations,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
representations,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 414–430. 2017, pp. 3815–3825.
[40] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for [67] G. Lee, D. Kim, W. Oh, K. Lee, and S. Oh, “MixGAIL: Autonomous
autonomous vehicles: Problems, datasets and state-of-the-art,” 2017, driving using demonstrations with mixed qualities,” in Proc. IEEE/RSJ
arXiv:1704.05519. Int. Conf. Intell. Robots Syst., 2020, pp. 5425–5430.
[41] A. Tampuu, T. Matiisen, M. Semikin, D. Fishman, and N. Muhammad, [68] I. Goodfellow et al., “Generative adversarial networks,” Commun. ACM,
“A survey of end-to-end driving: Architectures and training methods,” vol. 63, pp. 139–144, 2020.
IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 4, pp. 1364–1384, [69] H. Wang, P. Cai, R. Fan, Y. Sun, and M. Liu, “End-to-end interactive
Apr. 2022. prediction and planning with optical flow distillation for autonomous
[42] S. Teng et al., “Motion planning for autonomous driving: The state of driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops,
the art and future perspectives,” IEEE Trans. Intell. Veh., vol. 8, no. 6, 2021, pp. 2229–2238.
pp. 3692–3711, Jun. 2023. [70] P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion
[43] D. Coelho and M. Oliveira, “A review of end-to-end autonomous driving planning with self-supervised freespace forecasting,” in Proc. IEEE Conf.
in urban environments,” IEEE Access, vol. 10, pp. 75296–75311, 2022. Comput. Vis. Pattern Recognit., 2021, pp. 12727–12736.
[44] A. O. Ly and M. Akhloufi, “Learning to drive by imitation: An overview [71] T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan,
of deep behavior cloning methods,” IEEE Trans. Intell. Veh., vol. 6, no. 2, “Differentiable raycasting for self-supervised occupancy forecasting,”
pp. 195–209, Jun. 2021. in Proc. Eur. Conf. Comput. Vis., 2022, pp. 353–369.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10179
[72] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” [99] K. Chitta, D. Dauner, and A. Geiger, “SLEDGE: Synthesizing sim-
IEEE Trans. Neural Netw. Learn. Syst., vol. 9, no. 5, pp. 1054–1054, ulation environments for driving agents with generative models,”
Sep. 1998. 2024, arXiv:2403.17933.
[73] B. Jaeger and A. Geiger, “An invitation to deep reinforcement learning,” [100] S. Suo, S. Regalado, S. Casas, and R. Urtasun, “TrafficSm: Learning to
2023, arXiv:2312.08365. simulate realistic multi-agent behaviors,” in Proc. IEEE Conf. Comput.
[74] V. Mnih et al., “Human-level control through deep reinforcement learn- Vis. Pattern Recognit., 2021, pp. 10395–10404.
ing,” Nature, vol. 518, pp. 529–533, 2015. [101] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states
[75] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade in empirical observations and microscopic simulations,” Phys. Rev. E,
learning environment: An evaluation platform for general agents,” J. Artif. vol. 62, 2000, Art. no. 1805.
Intell. Res., vol. 47, pp. 253–279, 2013. [102] Z. Zhong et al., “Guided conditional diffusion for controllable traf-
[76] D. Horgan et al., “Distributed prioritized experience replay,” fic simulation,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
2018, arXiv:1803.00933. pp. 3560–3566.
[77] J. Bjorck, C. P. Gomes, and K. Q. Weinberger, “Towarddeeper deep [103] D. Xu, Y. Chen, B. Ivanovic, and M. Pavone, “Bits: Bi-level imitation
reinforcement learning with spectral normalization,” in Proc. Int. Conf. for traffic simulation,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
Neural Inf. Process. Syst., 2021, pp. 8242–8255. pp. 2929–2936.
[78] M. Toromanoff, E. Wirbel, and F. Moutarde, “Is deep reinforce- [104] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “TrafficBots:
ment learning really superhuman on atari? Leveling the playing field,” Towards world models for autonomous driving simulation and mo-
2019, arXiv:1908.04683. tion prediction,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
[79] E. Ohn-Bar, A. Prakash, A. Behl, K. Chitta, and A. Geiger, “Learning pp. 1522–1529.
situational driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [105] S. Manivasagam et al., “LiDARsi: Realistic LiDAR simulation by lever-
2020, pp. 11293–11302. aging the real world,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[80] W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, “Re- 2020, pp. 11167–11176.
ward (Mis)design for autonomous driving,” Artif. Intell., vol. 316, 2023, [106] Y. Chen et al., “Geosim: Realistic video simulation via geometry-aware
Art. no. 103829. composition for self-driving,” in Proc. IEEE Conf. Comput. Vis. Pattern
[81] C. Zhang et al., “Rethinking closed-loop training for autonomous driv- Recognit., 2021, pp. 72300–7240.
ing,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 264–282. [107] Z. Yang et al., “UniSim: A neural closed-loop sensor simulator,” in Proc.
[82] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1389–1399.
Learning behaviors by latent imagination,” in Proc. Int. Conf. Learn. [108] A. Petrenko, E. Wijmans, B. Shacklett, and V. Koltun, “Megaverse:
Representations, 2020. Simulating embodied agents at one million experiences per second,” in
[83] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with Proc. Int. Conf. Mach. Learn., 2021, pp. 8556–8566.
discrete world models,” in Proc. Int. Conf. Learn. Representations, [109] Z. Song et al., “Synthetic datasets for autonomous driving: A survey,”
2021. IEEE Trans. Intell. Veh., vol. 9, no. 1, pp. 1847–1864, Jan. 2024.
[84] D. Ha and J. Schmidhuber, “Recurrent world models facilitate pol- [110] A. Amini et al., “Learning robust control policies for end-to-end au-
icy evolution,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, tonomous driving from data-driven simulation,” IEEE Robot. Automat.
pp. 2455–2467. Lett., vol. 5, no. 2, pp. 1143–1150, Apr. 2020.
[85] M. Buehler, K. Iagnemma, and S. Singh, in The 2005 DARPA Grand [111] A. Amini et al., “VISTA 2.0: An open, data-driven simulator for mul-
Challenge: The Great Robot Race. Berlin, Germany: springer, 2007. timodal sensing and policy learning for autonomous vehicles,” in Proc.
[86] M. Buehler, K. Iagnemma, and S. Singh, The DARPA Urban Challenge: IEEE Int. Conf. Robot. Automat., 2022, pp. 2419–2426.
Autonomous Vehicles in City Traffic. Berlin, Germany: Springer, 2009. [112] T.-H. Wang, A. Amini, W. Schwarting, I. Gilitschenski, S. Karaman, and
[87] U. of Michigan, “Mcity,” 2015. [Online]. Available: https://mcity.umich. D. Rus, “Learning interactive driving policies via data-driven simulation,”
edu/ in Proc. IEEE Int. Conf. Robot. Automat., 2022, pp. 7745–7752.
[88] T. Team, “Torcs, the open racing car simulator.” 2000. [Online]. Avail- [113] B. Mildenhall, P.P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi,
able: https://sourceforge.net/projects/torcs/ and R. Ng, “NeRF: Representing scenes as neural radiance fields for view
[89] M. Martinez, C. Sitawarin, K. Finch, L. Meincke, A. Yablonski, and synthesis,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 405–421.
A. Kornhauser, “Beyond grand theft auto V for training, testing and [114] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D gaussian
enhancing deep learning in self driving cars,” 2017, arXiv:1712.01397. splatting for real-time radiance field rendering,” ACM Trans. Graph.,
[90] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: vol. 42, 2023, Art. no. 139.
An open urban driving simulator,” in Proc. Conf. Robot Learn., 2017, [115] M. Tancik et al., “Block-neRF: Scalable large scene neural view syn-
pp. 1–16. thesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
[91] D. Team, “Deepdrive: A simulator that allows anyone with a PC to push pp. 8238–8248.
the state-of-the-art in self-driving,” 2020. [Online]. Available: https:// [116] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-NERF: Scalable
github.com/deepdrive/deepdrive construction of large-scale nerfs for virtual fly-throughs,” in Proc. IEEE
[92] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “Metadrive: Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12922–12931.
Composing diverse driving scenarios for generalizable reinforcement [117] A. Kundu et al., “Panoptic neural fields: A semantic object-aware neural
learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, scene representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
pp. 3461–3475, Mar. 2023. nit., 2022, pp. 12871–12881.
[93] M. Hendrikx, S. Meijer, J. Van Der Velden, and A. Iosup, “Procedural [118] Y. Yang, Y. Yang, H. Guo, R. Xiong, Y. Wang, and Y. Liao, “Urbangiraffe:
content generation for games: A survey,” ACM Trans. Multimedia Com- Representing urban scenes as compositional generative neural feature
put. Commun. Appl., vol. 9, pp. 1–22, 2013. fields,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 9199–9210.
[94] D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni- [119] S. R. Richter, H. A. Alhaija, and V. Koltun, “Enhancing photorealism
Vincentelli, and S. A. Seshia, “Scenic: A language for scenario specifica- enhancement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2,
tion and scene generation,” in Proc. 40th ACM SIGPLAN Conf. Program. pp. 1700–1715, Feb. 2023.
Lang. Des. Implementation, 2019, pp. 63–78. [120] A. Schoonwinkel, Design and Test of a Computer Stabilized Unicycle,
[95] F. Hauer, T. Schmidt, B. Holzmüller, and A. Pretschner, “Did we test Stanford, CA, USA: Stanford University, 1987. [Online]. Available:
all scenarios for automated and autonomous driving systems?,” in Proc. https://books.google.com/books?id=LA8lGwAACAAJ
IEEE Intell. Transp. Syst. Conf., 2019, pp. 2950–2955. [121] P. Polack, F. Altché, B. d’Andréa Novel, and A. de La Fortelle, “The
[96] S. Tan et al., “SceneGen: Learning to generate realistic traffic scenes,” in kinematic bicycle model: A consistent model for planning feasible tra-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 892–901. jectories for autonomous vehicles?,” in Proc. IEEE Intell. Veh. Symp.,
[97] L. Bergamini et al., “SimNet: Learning reactive self-driving simulations 2017, pp. 812–818.
from real-world observations,” in Proc. IEEE Int. Conf. Robot. Automat., [122] R. Rajamani, Vehicle Dynamics and Control. Berlin, Germany: Springer,
2021, pp. 5119–5125. 2011.
[98] L. Feng, Q. Li, Z. Peng, S. Tan, and B. Zhou, “TrafficGen: Learning to [123] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline
generate diverse and realistic traffic scenarios,” in Proc. IEEE Int. Conf. evaluation of vision-based driving models,” in Proc. Eur. Conf. Comput.
Robot. Automat., 2023, pp. 3567–3575. Vis., 2018, pp. 236–251.
10180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
[124] N. Contributors, “NAVSIM: Data-driven non-reactive autonomous [148] H. Shao, L. Wang, R. Chen, S. L. Waslander, H. Li, and Y. Liu, “Reason-
vehicle simulation,” 2024. [Online]. Available: https://github.com/ Net: End-to-end driving with temporal and global reasoning,” in Proc.
autonomousvision/navsim IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 13723–13733.
[125] D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta, “Parting with [149] Y. Li et al., “DeepFusion: Lidar-camera deep fusion for multi-modal 3D
misconceptions about learning-based vehicle motion planning,” in Proc. object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Conf. Robot Learn., 2023, pp. 1268–1281. 2022, pp. 17161–17170.
[126] H. Caesar et al., “nuScenes: A multimodal dataset for autonomous [150] S. Borse et al., “X-align: Cross-modal cross-view alignment for bird’s-
driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, eye-view segmentation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis.,
pp. 1618–11628. 2023, pp. 3287–3297.
[127] B. Wilson et al., “Argoverse 2: Next generation datasets for self-driving [151] P. Anderson et al., “Vision-and-language navigation: Interpreting
perception and forecasting,” in Proc. Int. Conf. Neural Inf. Process. Syst. visually-grounded navigation instructions in real environments,” in Proc.
Datasets Benchmarks, 2021. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3674–3683.
[128] P. Sun et al., “Scalability in perception for autonomous driving: Waymo [152] M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways
open dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, for robotic manipulation,” in Proc. Conf. Robot Learn., 2022, pp. 894–
pp. 2446–2454. 906.
[129] J.-T. Zhai et al., “Rethinking the open-loop evaluation of end-to-end [153] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI:
autonomous driving in nuscenes,” 2023, arXiv:2305.10430. From simulators to research tasks,” IEEE Trans. Emerg. Topics Comput.
[130] Z. Li et al., “Is ego status all you need for open-loop end-to-end au- Intell., vol. 6, no. 2, pp. 230–244, Apr. 2022.
tonomous driving?,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [154] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chat-
2024, pp. 14864–14873. GPT for robotics: Design principles and model abilities,” 2023,
[131] T. Liang et al., “BEVFusion: A simple and robust liDAR-camera fu- arXiv:2306.17582.
sion framework,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, [155] T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M. F.
pp. 10421–10434. Moens, “Talk2car: Taking control of your self-driving car,” in Proc. Conf.
[132] Z. Liu et al., “Bevfusion: Multi-task multi-sensor fusion with unified Empirical Methods Natural Lang. Process., 2019.
bird’s-eye view representation,” in Proc. IEEE Int. Conf. Robot. Automat., [156] P. Mirowski et al., “Learning to navigate in cities without a map,” in Proc.
2023, pp. 2774–2781. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 2424–2435.
[133] R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, “Sensor fusion for [157] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “TOUCHDOWN:
semantic segmentation of urban scenes,” in Proc. IEEE Int. Conf. Robot. Natural language navigation and spatial reasoning in visual street envi-
Automat., 2015, pp. 1850–1857. ronments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
[134] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-Gonzalez, pp. 12530–12539.
“Sensor fusion for joint 3D object detection and semantic segmentation,” [158] R. Schumann and S. Riezler, “Generating landmark navigation instruc-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, tions from maps as a graph-to-text problem,” in Proc. Annu. Meeting
pp. 1230–1237. Assoc. Comput. Linguistics, 2021, pp. 489–502.
[135] B. Zhou, P. Krähenbühl, and V. Koltun, “Does computer vision matter [159] J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding human-
for action?,” Sci. Robot., vol. 4, 2019. to-vehicle advice for self-driving vehicles,” in Proc. IEEE Conf. Comput.
[136] Z. Gao et al., “Enhance sample efficiency and robustness of end-to- Vis. Pattern Recognit., 2019, pp. 10583–10591.
end urban autonomous driving via semantic masked world model,” [160] S. Narayanan, T. Maniar, J. Kalyanasundaram, V. Gandhi, B. Bhowmick,
2022, arXiv:2210.04017. and K. M. Krishna, “Talk to the vehicle: Language conditioned au-
[137] J. Chen, S. E. Li, and M. Tomizuka, “Interpretable end-to-end ur- tonomous navigation of self driving cars,” in Proc. IEEE/RSJ Int. Conf.
ban autonomous driving with latent deep reinforcement learning,” Intell. Robots Syst., 2019, pp. 5284–5290.
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 5068–5078, [161] J. Kim, S. Moon, A. Rohrbach, T. Darrell, and J. Canny, “Advisable
Jun. 2022. learning for self-driving vehicles by internalizing observation-to-action
[138] P. Cai, S. Wang, H. Wang, and M. Liu, “Carl-lead: Lidar-based end-to- rules,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
end autonomous driving with contrastive deep reinforcement learning,” pp. 9658–9667.
2021, arXiv:2109.08473. [162] J. Roh, C. Paxton, A. Pronobis, A. Farhadi, and D. Fox, “Conditional
[139] Z. Huang, C. Lv, Y. Xing, and J. Wu, “Multi-modal sensor fusion-based driving from natural language instructions,” in Proc. Conf. Robot Learn.,
deep neural network for end-to-end autonomous driving with scene 2019, pp. 540–551.
understanding,” IEEE Sensors J., vol. 21, no. 10, pp. 11781–11790, [163] K. Jain, V. Chhangani, A. Tiwari, K. M. Krishna, and V. Gandhi, “Ground
May 2021. then navigate: Language-guided navigation in dynamic scenes,” in Proc.
[140] O. Natan and J. Miura, “Fully end-to-end autonomous driving with IEEE Int. Conf. Robot. Automat., 2023, pp. 4113–4120.
semantic depth cloud mapping and multi-agent,” IEEE Trans. Intell. Veh., [164] D. Shah, B. Osiński, B. Ichter, and S. Levine, “LM-Nav: Robotic navi-
vol. 8, no. 1, pp. 557–571, Jun. 2022. gation with large pre-trained models of language, vision, and action,” in
[141] Y. Xiao, F. Codevilla, A. Gurram, O. Urfalioglu, and A. M. López, “Multi- Proc. Conf. Robot Learn., 2023, pp. 492–504.
modal end-to-end autonomous driving,” IEEE Trans. Intell. Transp. Syst., [165] A. Radford et al., “Learning transferable visual models from natural
vol. 23, no. 1, pp. 537–547, Jan. 2022. language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–
[142] I. Sobh et al., “End-to-end multi-modal sensors fusion system for urban 8763.
automated driving,” in Proc. Int. Conf. Neural Inf. Process. Syst. Work- [166] OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774.
shops, 2018. [167] H. Touvron et al., “LLaMA: Open and efficient foundation language
[143] Y. Chen et al., “LiDAR-video driving dataset: Learning driving policies models,” 2023, arXiv:2302.13971.
effectively,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, [168] J. Mao, Y. Qian, H. Zhao, and Y. Wang, “GPT-driver: Learning to drive
pp. 5870–5878. with GPT,” 2023, arXiv:2310.01415.
[144] H. M. Eraqi, M. N. Moustafa, and J. Honer, “Dynamic conditional [169] Z. Xu et al., “DriveGPT4: Interpretable end-to-end autonomous driving
imitation learning for autonomous driving,” IEEE Trans. Intell. Transp. via large language model,” 2023, arXiv:2310.01412.
Syst., vol. 23, no. 12, pp. 22988–23001, Dec. 2022. [170] H. Shao, Y. Hu, L. Wang, S. L. Waslander, Y. Liu, and H. Li, “LMDrive:
[145] S. Chowdhuri, T. Pankaj, and K. Zipser, “Multinet: Multi-modal multi- Closed-loop end-to-end driving with large language models,” in Proc.
task learning for autonomous driving,” in Proc. IEEE Winter Conf. Appl. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15120–15130.
Comput. Vis., 2019, pp. 1496–1504. [171] C. Sima et al., “DriveLM: Driving with graph visual question answering,”
[146] P. Cai, S. Wang, Y. Sun, and M. Liu, “Probabilistic end-to-end vehicle 2023, arXiv:2312.14150.
navigation in complex dynamic environments with multimodal sensor [172] T. Qian, J. Chen, L. Zhuo, Y. Jiao, and Y.-G. Jiang, “Nuscenes-
fusion,” IEEE Robot. Automat. Lett., vol. 5, no. 3, pp. 4218–4224, QA: A multi-modal visual question answering benchmark for au-
Jul. 2020. tonomous driving scenario,” in Proc. AAAI Conf. Artif. Intell., 2024,
[147] Q. Zhang, M. Tang, R. Geng, F. Chen, R. Xin, and L. Wang, “MMFN: pp. 4542–4550.
Multi-modal-fusion-net for end-to-end driving,” in Proc. IEEE/RSJ Int. [173] Z. Yang, X. Jia, H. Li, and J. Yan, “A survey of large language models
Conf. Intell. Robots Syst., 2022, pp. 8638–8643. for autonomous driving,” 2023, arXiv:2311.01043.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10181
[174] B. Hilleli and R. El-Yaniv, “Toward deep reinforcement learning without [200] M. Henaff, A. Canziani, and Y. LeCun, “Model-predictive policy learning
a simulator: An autonomous steering example,” in Proc. AAAI Conf. Artif. with uncertainty regularization for driving in dense traffic,” in Proc. Int.
Intell., 2018, pp. 1471–1478. Conf. Learn. Representations, 2019.
[175] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [201] J. Wu, Z. Huang, and C. Lv, “Uncertainty-aware model-based reinforce-
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, ment learning: Methodology and application in autonomous driving,”
pp. 770–778. IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 194–203, Jan. 2022.
[176] Y. Lee, J.-W. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and GPU- [202] M. Pan, X. Zhu, Y. Wang, and X. Yang, “Iso-dream: Isolating and
computation efficient backbone network for real-time object detection,” leveraging noncontrollable visual dynamics in world models,” in Proc.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, Int. Conf. Neural Inf. Process. Syst., 2022, pp. 23178–23191.
pp. 752–760. [203] J. Yang et al., “Generalized predictive model for autonomous driv-
[177] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024,
image recognition at scale,” in Proc. Int. Conf. Learn. Representations, pp. 14662–14672.
2021. [204] Y. Wang, J. He, L. Fan, H. Li, Y. Chen, and Z. Zhang, “Driving into
[178] M. Dehghani et al., “Scaling vision transformers to 22 billion parame- the future: Multiview visual forecasting and planning with world model
ters,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 7480–7512. for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern
[179] H. Li et al., “Delving into the devils of bird’s-eye-view perception: A Recognit., 2024, pp. 14749–14759.
review, evaluation and recipe,” IEEE Trans. Pattern Anal. Mach. Intell., [205] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
vol. 46, no. 4, pp. 2151–2170, Apr. 2024. resolution image synthesis with latent diffusion models,” in Proc. IEEE
[180] Z. Li et al., “BEVFormer: Learning bird’s-eye-view representation from Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10684–10695.
multi-camera images via spatiotemporal transformers,” in Proc. Eur. [206] A. Hu et al., “Model-based imitation learning for urban driving,” in Proc.
Conf. Comput. Vis., 2022, pp. 1–18. Int. Conf. Neural Inf. Process. Syst., 2022.
[181] X. Jia, Y. Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “DriveAdapter: [207] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75,
Breaking the coupling barrier of perception and planning in end-to-end 1997.
autonomous driving,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, [208] K. Ishihara, A. Kanervisto, J. Miura, and V. Hautamaki, “Multi-task learn-
pp. 7953–7963. ing with attention for end-to-end autonomous driving,” in Proc. IEEE
[182] W. Tong et al., “Scene as occupancy,” in Proc. IEEE Int. Conf. Comput. Conf. Comput. Vis. Pattern Recognit. Workshops, 2021, pp. 2896–2905.
Vis., 2023, pp. 8406–8415. [209] Z. Li, T. Motoyoshi, K. Sasaki, T. Ogata, and S. Sugano, “Rethinking
[183] Q. Li, Y. Wang, Y. Wang, and H. Zhao, “HDMapNet: An online HD map self-driving: Multi-task knowledge for better generalization and accident
construction and evaluation framework,” in Proc. IEEE Int. Conf. Robot. explanation ability,” 2018, arXiv:1809.11100.
Automat., 2022, pp. 4628–4634. [210] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
[184] B. Liao et al., “MapTR: Structured modeling and learning for online vec- models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
torized HD map construction,” in Proc. Int. Conf. Learn. Representations, Vis. Pattern Recognit., 2017, pp. 2174–2182.
2023. [211] A. Mehta, A. Subramanian, and A. Subramanian, “Learning end-to-end
[185] H. Wang et al., “Openlane-v2: A topology reasoning benchmark for autonomous driving using guided auxiliary supervision,” in Proc. 11th
unified 3D HD mapping,” in Proc. Int. Conf. Neural Inf. Process. Syst. Indian Conf. Comput. Vis. Graph. Image Process., 2018, Art. no. 11.
Datasets Benchmarks, 2023, pp. 18873–18884. [212] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning to steer by mimicking
[186] T. Li et al., “Topology reasoning for driving scenes,” 2023, features from heterogeneous auxiliary networks,” in Proc. AAAI Conf.
arXiv:2304.05277. Artif. Intell., 2019, pp. 8433–8440.
[187] T. Li et al., “Lanesegnet: Map learning with lane segment perception for [213] A. Zhao, T. He, Y. Liang, H. Huang, G. Van den Broeck, and S. Soatto,
autonomous driving,” in Proc. Int. Conf. Learn. Representations, 2024. “SAM: Squeeze-and-mimic networks for conditional visual driving pol-
[188] G. Wang, H. Niu, D. Zhu, J. Hu, X. Zhan, and G. Zhou, “A versatile icy learning,” in Proc. Conf. Robot Learn., 2020, pp. 156–175.
and efficient reinforcement learning framework for autonomous driving,” [214] É. Zablocki, H. Ben-Younes, P. Pérez, and M. Cord, “Explainability of
2021, arXiv:2110.11573. deep vision-based autonomous driving systems: Review and challenges,”
[189] A. Behl, K. Chitta, A. Prakash, E. Ohn-Bar, and A. Geiger, “Label Int. J. Comput. Vis., vol. 130, pp. 2425–2452, 2022.
efficient visual abstractions for autonomous driving,” in Proc. IEEE/RSJ [215] M. Bojarski et al., “Explaining how a deep neural network trained with
Int. Conf. Intell. Robots Syst., 2020, pp. 2338–2345. end-to-end learning steers a car,” 2017, arXiv:1704.07911.
[190] S.-H. Chung, S.-H. Kong, S. Cho, and I. M. A. Nahrendra, “Segmented [216] M. Bojarski et al., “VisualBackPro: Efficient visualization of CNNs for
encoding for Sim2Real of RL-based end-to-end autonomous driving,” in autonomous driving,” in Proc. IEEE Int. Conf. Robot. Automat., 2018,
Proc. IEEE Intell. Veh. Symp., 2022, pp. 1290–1296. pp. 4701–4708.
[191] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013, [217] S. Mohseni, A. Jagadeesh, and Z. Wang, “Predicting model failure using
arXiv:1312.6114. saliency maps in autonomous driving systems,” 2019, arXiv:1905.07679.
[192] M. Ahmed, A. Abobakr, C. P. Lim, and S. Nahavandi, “Policy-based [218] J. Kim and J. Canny, “Interpretable learning for self-driving cars by
reinforcement learning for training autonomous driving agents in urban visualizing causal attention,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
areas with affordance learning,” IEEE Trans. Intell. Transp. Syst., vol. 23, pp. 2961–2969.
no. 8, pp. 12562–12571, Aug. 2022. [219] K. Mori, H. Fukui, T. Murase, T. Hirakawa, T. Yamashita, and H. Fu-
[193] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning jiyoshi, “Visual explanation by attention branch network for end-to-end
for driving in urban environments,” in Proc. Conf. Robot Learn., 2018, learning-based self-driving,” in Proc. IEEE Intell. Veh. Symp., 2019,
pp. 237–252. pp. 1577–1582.
[194] X. Zhang, M. Wu, H. Ma, T. Hu, and J. Yuan, “Multi-task long-range ur- [220] D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object-centric
ban driving based on hierarchical planning and reinforcement learning,” policies for autonomous driving,” in Proc. IEEE Int. Conf. Robot. Au-
in Proc. IEEE Int. Intell. Transp. Syst. Conf., 2021, pp. 726–733. tomat., 2019, pp. 8853–8859.
[195] C. Huang et al., “Deductive reinforcement learning for visual autonomous [221] L. Cultrera, L. Seidenari, F. Becattini, P. Pala, and A. Del Bimbo,
urban driving navigation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, “Explaining autonomous driving by learning end-to-end visual attention,”
no. 12, pp. 5379–5391, Dec. 2021. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2020,
[196] R. Cheng, C. Agia, F. Shkurti, D. Meger, and G. Dudek, “Latent attention pp. 1389–1398.
augmentation for robust autonomous driving policies,” in Proc. IEEE/RSJ [222] Y. Xiao, F. Codevilla, D. P. Bustamante, and A. M. Lopez, “Scaling
Int. Conf. Intell. Robots Syst., 2021, pp. 130–136. self-supervised end-to-end driving with multi-view attention learning,”
[197] J. Yamada, K. Pertsch, A. Gunjal, and J. J. Lim, “Task-induced represen- 2023, arXiv:2302.03198.
tation learning,” in Proc. Int. Conf. Learn. Representations, 2022. [223] K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata, and A. Geiger,
[198] J. Chen and S. Pan, “Learning generalizable representations for rein- “Plant: Explainable planning transformers via object-level representa-
forcement learning via adaptive meta-learner of behavioral similarities,” tions,” in Proc. Conf. Robot Learn., 2022, pp. 459–470.
in Proc. Int. Conf. Learn. Representations, 2022. [224] Y. Sun, X. Wang, Y. Zhang, J. Tang, X. Tang, and J. Yao, “In-
[199] Z. Yang, L. Chen, Y. Sun, and H. Li, “Visual point cloud forecasting terpretable end-to-end driving model for implicit scene understand-
enables scalable autonomous driving,” in Proc. IEEE Conf. Comput. Vis. ing,” in Proc. IEEE 26th Int. Conf. Intell. Transp. Syst., 2023,
Pattern Recognit., 2024, pp. 14673–14684. pp. 2874–2880.
10182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024
[225] C. Liu, Y. Chen, M. Liu, and B. E. Shi, “Using eye gaze to en- [250] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without
hance generalization of imitation networks to unseen environments,” forgetting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 5, pp. 2066–2074, pp. 4367–4375.
May 2021. [251] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
[226] W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and R. Urtasun, “DSDNET: empirical risk minimization,” in Proc. Int. Conf. Learn. Representations,
Deep structured self-driving network,” in Proc. Eur. Conf. Comput. Vis., 2017.
2020, pp. 156–172. [252] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
[227] A. Cui, S. Casas, A. Sadat, R. Liao, and R. Urtasun, “Lookout: Diverse dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
multi-future prediction and planning for self-driving,” in Proc. IEEE Int. pp. 2980–2988.
Conf. Comput. Vis., 2021, pp. 16107–16116. [253] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss
[228] Y. Xu et al., “Explainable object-induced action decision for autonomous based on effective number of samples,” in Proc. IEEE Conf. Comput. Vis.
vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, Pattern Recognit., 2019, pp. 9268–9277.
pp. 9523–9532. [254] S. Akhauri, L. Y. Zheng, and M. C. Lin, “Enhanced transfer learning
[229] H. Ben-Younes, É. Zablocki, P. Pérez, and M. Cord, “Driving behavior for autonomous driving with systematic accident simulation,” in Proc.
explanation with multi-level fusion,” Pattern Recognit., vol. 123, 2022, IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020, pp. 5986–5993.
Art. no. 108421. [255] Q. Li, Z. Peng, Q. Zhang, C. Liu, and B. Zhou, “Improving the gen-
[230] B. Jin et al., “Adapt: Action-aware driving caption transformer,” in Proc. eralization of end-to-end driving through procedural generation,” 2020,
IEEE Int. Conf. Robot. Automat., 2023, pp. 7554–7561. arXiv:2012.13681.
[231] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration [256] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in Proc.
of modern neural networks,” in Proc. Int. Conf. Mach. Learn., 2017, 21st Int. Conf. Intell. Transp. Syst., 2018, pp. 2575–2582.
pp. 1321–1330. [257] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scal-
[232] A. Loquercio, M. Segu, and D. Scaramuzza, “A general framework for able end-to-end autonomous vehicle testing via rare-event simulation,”
uncertainty estimation in deep learning,” IEEE Robot. Automat. Lett., in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 9849–9860.
vol. 5, no. 2, pp. 3153–3160, Apr. 2020. [258] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generating adver-
[233] R. Michelmore, M. Kwiatkowska, and Y. Gal, “Evaluating uncer- sarial driving scenarios in high-fidelity simulators,” in Proc. IEEE Int.
tainty quantification in end-to-end autonomous driving control,” 2018, Conf. Robot. Automat., 2019, pp. 8271–8277.
arXiv:1811.06817. [259] W. Ding, B. Chen, B. Li, K. J. Eun, and D. Zhao, “Multimodal safety-
[234] A. Filos, P. Tigkas, R. McAllister, N. Rhinehart, S. Levine, and Y. critical scenarios generation for decision-making algorithms evaluation,”
Gal, “Can autonomous vehicles identify, recover from, and adapt IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 1551–1558, Apr. 2021.
to distribution shifts?,” in Proc. Int. Conf. Mach. Learn., 2020, [260] L. Zhang, Z. Peng, Q. Li, and B. Zhou, “CAT: Closed-loop adversarial
pp. 3145–3153. training for safe end-to-end driving,” in Proc. Conf. Robot Learn., 2023,
[235] L. Tai, P. Yun, Y. Chen, C. Liu, H. Ye, and M. Liu, “Visual-based pp. 2357–2372.
autonomous driving deployment from a stochastic and uncertainty-aware [261] L. T. Triess, M. Dreissig, C. B. Rist, and J. M. Zöllner, “A survey on
perspective,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019, deep domain adaptation for LiDAR perception,” in Proc. IEEE Intell.
pp. 2622–2628. Veh. Symp. Workshops, 2021, pp. 350–357.
[236] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet: A vision-based trajectory [262] Y. You, X. Pan, Z. Wang, and C. Lu, “Virtual to real reinforcement
generation network for autonomous vehicles in urban environments,” learning for autonomous driving,” in Proc. Brit. Mach. Vis. Conf., 2017.
IEEE Trans. Intell. Veh., vol. 6, no. 3, pp. 419–429, Sep. 2021. [263] A. Bewley et al., “Learning to drive from simulation without real world
[237] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model labels,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 4818–4824.
of safe and scalable self-driving cars,” 2017, arXiv:1708.06374. [264] J. Xing, T. Nagata, K. Chen, X. Zou, E. Neftci, and J. L. Krichmar,
[238] T. Brüdigam, M. Olbrich, D. Wollherr, and M. Leibold, “Stochastic model “Domain adaptation in reinforcement learning via latent unified state rep-
predictive control with a safety guarantee for automated driving,” IEEE resentation,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 10452–10459.
Trans. Intell. Veh., vol. 8, no. 1, pp. 22–36, Jan. 2023. [265] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[239] Y. Lyu, W. Luo, and J. M. Dolan, “Probabilistic safety-assured adaptive “Domain randomization for transferring deep neural networks from
merging control for autonomous vehicles,” in Proc. IEEE Int. Conf. simulation to the real world,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Robot. Automat., 2021, pp. 10764–10770. Syst., 2017, pp. 23–30.
[240] J. P. Allamaa, P. Patrinos, T. Ohtsuka, and T. D. Son, “Real-time MPC with [266] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real
control barrier functions for autonomous driving using safety enhanced transfer of robotic control with dynamics randomization,” in Proc. IEEE
collocation,” 2024, arXiv:2401.06648. Int. Conf. Robot. Automat., 2018, pp. 3803–3810.
[241] R. Geirhos et al., “Shortcut learning in deep neural networks,” Nature [267] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learn-
Mach. Intell., vol. 2, pp. 665–673, 2020. ing for deformable object manipulation,” in Proc. Conf. Robot Learn.,
[242] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imi- 2018, pp. 734–743.
tation learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, [268] B. Osiński et al., “Simulation-based reinforcement learning for real-world
pp. 11698–11709. autonomous driving,” in Proc. IEEE Int. Conf. Robot. Automat., 2020,
[243] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. LeCun, “Off-road obstacle pp. 6411–6418.
avoidance through end-to-end learning,” in Proc. Int. Conf. Neural Inf. [269] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, “A survey of
Process. Syst., 2005, pp. 739–746. zero-shot generalisation in deep reinforcement learning,” J. Artif. Intell.
[244] M. Bansal, A. Krizhevsky, and A. S. Ogale, “ChauffeurNet: Learning to Res., vol. 76, pp. 201–264, 2023.
drive by imitating the best and synthesizing the worst,” Robotics: Sci. [270] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot
Syst. Conf, 2019. learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 4080–
[245] C. Chuang, D. Yang, C. Wen, and Y. Gao, “Resolving copycat problems 4090.
in visual imitation learning via residual action prediction,” in Proc. Eur. [271] P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differ-
Conf. Comput. Vis., 2022, pp. 392–409. entiable and modular control stack for autonomous vehicles,” in Proc.
[246] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of Conf. Robot Learn., 2022, pp. 2170–2180.
the class imbalance problem in convolutional neural networks,” Neural [272] H. Li et al., “Open-sourced data ecosystem in autonomous driving: The
Netw., vol. 106, pp. 249–259, 2018. present and future,” 2023, arXiv:2312.03408.
[247] J. Byrd and Z. Lipton, “What is the effect of importance weighting in [273] A. Kirillov et al., “Segment anything,” in Proc. IEEE Int. Conf. Comput.
deep learning?,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 872–881. Vis., 2023, pp. 4015–4026.
[248] I. Mani and I. Zhang, “KNN approach to unbalanced data distributions: [274] S. Narang and A. Chowdhery, “Pathways language model (PaLM): Scal-
A case study involving information extraction,” in Proc. Int. Conf. Mach. ing to 540 billion parameters for breakthrough performance,” J. Mach.
Learn. Workshops, 2003. Learn. Res., vol. 24, pp. 11324–11436, 2022.
[249] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class- [275] Y. Fang et al., “Exploring the limits of masked visual representation
imbalance learning,” IEEE Trans. Syst., Man, Cybern. B Cybern., vol. 39, learning at scale,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
no. 2, pp. 539–550, Apr. 2009. 2023, pp. 19358–19369.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10183
[276] M. Oquab et al., “DINOv2: Learning robust visual features without Bernhard Jaeger received the BSc degree in in-
supervision,” Trans. Mach. Learn. Res., 2024. formatics: Games engineering from the Technical
[277] J.-B. Alayrac et al., “Flamingo: A visual language model for few-shot University of Munich, in 2018, and the MSc degree
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 23716– in computer science from the University of Tübin-
23736. gen, in 2021. He is currently working toward the
[278] L. Ouyang et al., “Training language models to follow instructions with PhD degree with the Autonomous Vision Group led
human feedback,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, by Prof. Andreas Geiger, part of the University of
pp. 27730–27744. Tübingen and Tübingen AI Center, Germany. His
[279] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked research interests include most aspects of embodied
autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. intelligence such as vision and decision making, with
Vis. Pattern Recognit., 2022, pp. 16000–16009. a focus on autonomous driving.