A Survey For Foundation Models in Autonomous Driving
A Survey For Foundation Models in Autonomous Driving
Driving
Haoxiang Gao1 , Zhongruo Wang2 , Yaqian Li3 , Kaiwen Long3 ,
Ming Yang4 , Yiqing Shen5*
1 Motional AD LLC.
2 Amazon.
3 Li
Auto Inc..
arXiv:2402.01105v4 [cs.LG] 5 Sep 2024
Abstract
The advent of foundation models has revolutionized the fields of natural lan-
guage processing and computer vision, paving the way for their application in
autonomous driving (AD). This survey presents a comprehensive review of more
than 40 research papers, demonstrating the role of foundation models in enhanc-
ing AD. Large language models contribute to planning and simulation in AD,
particularly through their proficiency in reasoning, code generation and transla-
tion. In parallel, vision foundation models are increasingly adapted for critical
tasks such as 3D object detection and tracking, as well as creating realistic driving
scenarios for simulation and testing. Multi-modal foundation models, integrating
diverse inputs, exhibit exceptional visual understanding and spatial reasoning,
crucial for end-to-end AD. This survey not only provides a structured taxon-
omy, categorizing foundation models based on their modalities and functionalities
within the AD domain but also delves into the methods employed in current
research. It identifies the gaps between existing foundation models and cutting-
edge AD approaches, thereby charting future research directions and proposing
a roadmap for bridging these gaps.
1
1 Introduction
The integration of deep learning (DL) into autonomous driving (AD) has marked
a significant leap in this field, attracting attention from both academic and indus-
trial spheres. AD systems, equipped with cameras and lidars, mimic human-like
decision-making processes. These systems are fundamentally composed of three key
components: perception, prediction, and planning. Perception, utilizing DL and com-
puter vision algorithms, focuses on object detection and tracking. Prediction forecasts
the behavior of traffic agents and their interaction with autonomous vehicles. Plan-
ning, which is typically structured hierarchically, involves making strategic driving
decisions, calculating optimal trajectories, and executing vehicle control commands.
The advent of foundation models, particularly renowned in natural language process-
ing and computer vision, has introduced new dimensions to AD research. These models
are distinct due to their training on extensive web-scale datasets and their massive
parameter sizes. Given the vast amounts of data generated by autonomous vehicle ser-
vices and advancements in AI, including NLP and AI-generated content (AIGC), there
is a growing curiosity about the potential of foundation models in AD. These models
could be instrumental in performing a range of AD tasks, such as object detection,
scene understanding, and decision-making, with a level of intelligence akin to human
drivers.
Foundation models address several challenges in AD. Conventionally, AD models
are trained in a supervised manner, dependent on manually annotated data that often
lack diversity, limiting their adaptability. Foundation models, however, show superior
generalization capabilities due to their training on diverse, web-scale data. They can
potentially replace the complex heuristic rule-based systems in planning with their
reasoning capabilities and knowledge derived from extensive pre-training. For example,
LLM has reasoning capability and common sense driving knowledge acquired from
the pre-training dataset, which can potentially replace heuristic rule-based planning
systems, which require complex engineering effort of hand-crafted rules in software
codes and debugging on corner cases. Generative models within this domain can create
realistic traffic scenarios for simulation, essential for testing safety and reliability in
rare or challenging situations. Moreover, foundation models contribute to making AD
technology more user-centric, with language models understanding and executing user
commands in natural language.
Despite considerable research in applying foundation models to AD, there are
notable limitations and gaps in real-world application. Our survey aims to provide
a systematic review and propose future research directions. There are two surveys
related to foundation models for autonomous driving: LLM4Drive [1] is more focused
on large language models. [2] has a good breadth of summary of applications of foun-
dation models in autonomous driving, mainly in simulation, data annotation, and
planning. We expand upon existing surveys by covering vision foundation models and
multi-modal foundation models, analyzing their applications in prediction and percep-
tion tasks. This comprehensive approach includes detailed examinations of technical
aspects, such as pre-trained models and methods, and identifies future research oppor-
tunities. Innovatively, we propose a taxonomy categorizing foundation models in AD
based on modalities and functions, as shown in Figure 1. In the following sections,
2
Fig. 1 Taxonomy of foundation models for autonomous driving. It delineates the categorization of
foundation models according to their modalities, such as Large Language Models, Vision Foundation
Models, and Multi-modal Foundation Models, and correlates them with their respective functions in
autonomous driving.
we will explore the application of various foundation models, including large language
models, vision foundation models, and multi-modal foundation models, in the context
of AD.
3
attributed to training on extensive datasets. Later GPT models, including Chat-
GPT, GPT-4 [5] are trained using billions of parameters and crawled web data with
trillions of words, and achieve strong performance on many NLP tasks, including
translation, text summarization, question-answering. It also demonstrates one-shot
and few-shot reasoning capabilities to learn new skills from the context. More and
more researchers have started to apply these reasoning, understanding, and in-context
learning capabilities to address challenges in AD.
2.2 Applications in AD
2.2.1 Reasoning and Planning
The decision-making process in AD closely parallels human reasoning, necessitat-
ing the interpretation of environmental cues to make safe and comfortable driving
decisions. LLMs, through their training on diverse web data, have assimilated common-
sense knowledge pertinent to driving, drawing from a plethora of sources including
web forums and official government websites. This wealth of information enables LLMs
to engage in the nuanced decision-making required for AD. One method for harness-
ing LLMs in AD involves presenting them with detailed textual descriptions of the
driving environment, prompting them to propose driving decisions or control com-
mands. This process, as illustrated in Figure 2, typically encompasses comprehensive
prompts detailing agent states, such as coordinates, speed, and past trajectories, the
vehicle’s state i.e, velocity, and acceleration, and map specifics including traffic lights,
lane information, and intended route). For enhanced interaction understanding, LLMs
can also be directed to provide reasoning along with their responses. For instance, the
GPT driver [6] not only recommends vehicle actions but also elucidates the rationale
behind these suggestions, significantly enhancing the transparency and explainability
of autonomous driving decisions. This approach, exemplified by Driving with LLMs [7],
enhances the explainability of autonomous driving decisions. Similarly, the “Receive,
Reason, and React” approach [8] instructs LLM agents to assess lane occupancy and
evaluate the safety of potential actions, thereby fostering a deeper comprehension of
dynamic driving scenarios. These methods not only leverage LLMs’ inherent abil-
ity to understand complex scenarios but also employ their reasoning capabilities to
simulate human-like decision-making processes. Through the integration of detailed
environmental descriptions and strategic prompts, LLMs contribute significantly to
the planning and reasoning aspects of AD, offering insights and decisions that mirror
human judgment and expertise.
2.2.2 Prediction
Prediction forecasts traffic participants’ future trajectories, intents, and possible inter-
actions with the ego vehicle. The common deep learning-based models are based on
rasterized or vector images of the traffic scene, which encode spatial information. How-
ever, it is still challenging to accurately predict highly interactive scenes, which requires
reasoning and semantic information, for example, right-of-ways, vehicles’ turning sig-
nals, and pedestrians’ gestures. The text representation of the scene can provide more
semantic information, and better leverage LLM’s reasoning capability and common
4
Fig. 2 Common pattern of LLM pipelines for autonomous driving, showcasing the integration of
textual environment descriptions and LLM reasoning to inform driving decisions.
knowledge in the pre-training dataset. There is still not much research applying LLM
to trajectory prediction. [9] did an early exploration of LLM’s power to make trajec-
tory predictions. They convert the scene representation into text prompts, and use
BERT model to generate the text encoding, which is finally fused with image encod-
ing to decode trajectory prediction. Their evaluation shows significant improvement
compared with baselines only using image encoding or text encoding.
5
[13] uses LLM as a powerful interpreter translating user’s text query into structured
specifications of map lanes and vehicle locations for traffic simulation scenarios.
6
[6] has a different conclusion that using OpenAI fine-tuning performs significantly
better than few-shot learning. [7] also compared training from scratch and fine-
tuning approaches, and found using the pre-trained LLaMA model with LoRA-based
fine-tuning can perform better than training from scratch.
7
boost the inference speed for LLM-based decoder models, including model compres-
sion and knowledge distillation. A new attention structure like PagedAttention[22] has
been proposed for fast inference in the generative model. Quantization methods like
GPTQ[23], AWQ[24], SqueezeLLM[25] has been designed to achieve around 2.1x to
2.3x speedup compared with the baseline. Distillation methods like SLidR [26], ST-
SLiDR[27] have been proposed for LiDAR style input on detection and segmentation
tasks.
2.5 Summary
The publications in LLM are summarized in Figure 3. We propose more fine-
grained classifications by environments(real or sim), functions in autonomous driving,
foundation models, and techniques used in the research.
8
Fig. 3 Summary of publications in LLM for autonomous driving.
9
Fig. 4 Stable Diffusin
variable to pixel-wise images. It also has an optional text encoder and applies the cross-
attention mechanism to generate images conditional on prompts (text description or
other images).
DALL-E [37] model was trained with billions of image and text pairs and uses stable
diffusion to generate high-fidelity images and creative arts following human instruc-
tions. DALL-E 2 [38], an extension of DALL-E [37], integrates a CLIP encoder with
a diffusion decoder to handle both image generation and editing tasks, as depicted in
Fig. 19. Unlike Imagen, DALL-E 2 utilizes a prior network to translate between text
embeddings and image embeddings. Building on this, DALL-E 3 focuses on enhanc-
ing prompt adherence and caption quality. It first trains a robust image captioner
capable of generating detailed and accurate image descriptions, and then uses this
captioner to produce even more refined and detailed captions. There is growing inter-
est in the application of vision foundation models in autonomous driving, mainly for
3D perception and video generation tasks.
3.1 Perception
SAM3D [39] applies SAM(Segment-anything model) to 3D object detection in
autonomous driving. Lidar point clouds are projected to BEV(bird-eye-view) images,
and it uses 32x32 mesh grids to generate point prompts to detect masks for foreground
objects. It leverages the SAM model’s zero-shot transfer capability to generate seg-
mentation masks and 2D boxes. Then it uses vertical attributes of those lidar points
inside 2D boxes to generate 3D boxes. However, the Waymo Open Dataset evalua-
tion shows the average-precision metrics are still far from existing state-of-the-art 3D
object detection models. They observed that SAM trained foundation model can not
handle those sparse and noisy points very well, and often results in false negatives for
distant objects.
10
Fig. 5 Video generation pipeline for AD.
11
GAIA-1[43] is developed by Wayve to generate realistic driving videos. The world
model uses camera images, text descriptions, and vehicle control signals as input tokens
and predicts the next frame. The paper uses pre-trained DINO[28] model’s embedding
and cosine similarity loss to distill more semantic knowledge to image token embed-
ding. They use the video diffusion model[44] to decode high-fidelity driving scenes
from the predicted image token. There are two separate tasks to train the diffusion
model: image generation and video generation. The image generation task helps the
decoder generate high-quality images, while the video generation task uses temporal
attention to generate temporally consistent video frames. The generated video follows
high-level real-world constraints and has realistic scene dynamics, such as the object’s
location, interactions, traffic rules, and road structures. The video also shows diversity
and creativity, which have realistic possible outcomes conditioned on different text
descriptions and the ego vehicle’s action.
DriveDreamer [45] also uses the world model and diffusion model to generate video
for autonomous driving. In addition to images, text descriptions, and vehicle actions,
the model also uses more structural traffic information as input, such as HDMap
and object 3D boxes, so that the model can better understand higher-level structural
constraints of traffic scenes. The model training has two stages: the first stage is video
generation using the diffusion model conditioned on structured traffic information. It
was built on a pre-trained Stable-Diffusion model[33] with parameters frozen. In the
second stage, the model is trained with both future video prediction tasks and action
prediction tasks to better learn future prediction and interactions between objects.
[46] built a point cloud-based world model that achieves SOTA performance in
point cloud forecasting tasks. They propose a VQVAE-like [47] tokenizer to represent
3D point clouds as latent BEV tokens and use discrete diffusion to forecast future
point clouds given the past BEV tokens and ego vehicle’s actions tokens.
12
4 Multi-modal Foundation Models
Multi-modal foundation models benefit more by taking input data from multiple
modalities, e.g. sounds, images, and video, to perform more complex tasks, e.g.
generating text from images, analyzing and reasoning with visual inputs.
One of the most well-known multi-modal foundation models is CLIP[48]. The
model is pre-trained using the contrastive pre-training method. The inputs are noisy
images and text pairs, and the model is trained to predict if the given image and text
are a correct pair. The model is trained to maximize the cosine similarity of embedding
from the image encoder and text encoder. The CLIP model shows zero-shot transfer
ability for other computer vision tasks, such as image classification, and predicting
the correct text description of the class without supervised training.
Multi-modal foundation models, like LLaVA[49], LISA[50], and CogVLM[51] can
be used for the general-purpose visual AI agent, which demonstrates superior per-
formance in vision tasks, such as object segmentation, detection, localization, and
spatial reasoning. Video-LLaMA[52] can further perceive video and audio data, which
may help autonomous vehicles better understand the world from temporal images and
audio sequences.
Multi-modal foundation model is also used for robot learning, which leverages
the robot’s action as a new modality to create more general-purpose agents that
can perform real-world tasks. DeepMind proposed a vision-language-action model[53]
trained on text and images from the web and learned to output control commands to
complete real-world object manipulation tasks.
Transferring general knowledge from large-scale pre-training datasets to
autonomous driving, the multi-modal foundation models can be used for object
detection, visual understanding, and spatial reasoning, which enables more powerful
applications in autonomous driving.
13
quality. Their model backbone uses the pre-trained visual encoder and LLM weights
following BLIP2[55].
Talk2BEV[56] proposes an innovative bird’s-eye view (BEV) representation of the
scene fusing both visual and semantic information. The pipeline first generates the
BEV map from image and lidar data and uses general-purpose visual-language founda-
tion models to add more detailed text descriptions of cropped images of objects. The
JSON text representation of the BEV map is then passed to general-purpose LLM to
perform Visual QA, which covers spatial and visual reasoning tasks. The result shows
a good understanding of detailed instance attributes and also higher-level intent of
objects, and the ability to provide free-formed advice on the ego vehicle’s actions.
LiDAR-LLM[57] uses a novel approach that combines point cloud data with the
advanced reasoning abilities of Large Language Models to interpret real-world 3D
environments and achieves excellent performance in 3D captioning, grounding, and
QA tasks. The model employs a unique three-stage training and a View-Aware
Transformer(VAT) to align 3D data with text embedding, enhancing spatial compre-
hension. Their examples show the model can understand the traffic scenes and provide
suggestions for autonomous driving planning tasks.
[58] focus on the the explainability of vehicle’s actions using a visual QA approach.
They collected driving videos in simulated environments from 5 different action cate-
gories(like going straight and turning left) and used manually labeled explanations of
actions to train the model. The model was able to explain the driving decision based
on road geometry and clearance of obstacles. They find it promising to apply state-of-
the-art multi-modal foundation models to generate structured explanations of vehicle
actions.
14
LLaVA, it used the pre-trained CLIP[48] encoder and LLM weights and fine-tuned the
model with their instruction-following dataset specifically designed for autonomous
driving. They were able to build an end-to-end interpretable autonomous driving sys-
tem, which is able to have a good understanding of the surrounding environment and
make decisions on vehicle actions with jurisdictions and lower-level control commands.
15
Fig. 6 Road map of foundation models in AD.
We also notice that the dataset is one of the biggest obstacles in the future
development of foundation models in autonomous driving. The existing open-sourced
dataset[63] for autonomous driving at the scale of 1000 hours, is far less than pre-
training datasets used for state-of-the-art LLMs. The web dataset used for existing
foundation models doesn’t leverage all modalities required by autonomous driving,
such as lidar and surround cameras. The web data domain is also quite different from
the real driving scenes.
We propose the longer-term future road map in Figure 6. In the first stage, we
can collect a large-scale 2D dataset that can cover all data distribution, diversity,
and complexity of driving scenes in the real-world environment for pre-training or
fine-tuning. Most vehicles can be equipped with front cameras to collect the data in
different cities, at various times of the day. In the second stage, we can use smaller but
higher-quality 3D datasets with lidar to improve the foundation model’s 3D perception
and reasoning, for example, we can use existing state-of-the-art 3D object detection
models as teachers to fine-tune the foundation model. Finally, we can leverage human
driving examples or annotations in planning and reasoning for alignment, reaching the
utmost safety goal of autonomous driving.
Conflict of Interest
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
Availability of Data
The data that support the findings of this study are available from the corresponding
author upon request.
Ethical Considerations
This study is a literature review and did not involve human participants or animal
subjects. Therefore, formal ethics approval was not required.
16
Funding Information
This research received no specific grant from any funding agency in the public,
commercial, or not-for-profit sectors.
Author Contributions
Y.Shen conceptualized the study. H. Gao, Z.Wang, Y. Li, and K. Long performed the
literature review and analysis. H. Gao and Y. Shen contributed to the interpretation
of the results. H. Gao and Y. Shen drafted the manuscript. M. Yang and Y. Shen
contributed to the revision of the manuscript and approved the final version.
References
[1] Yang, Z., Jia, X., Li, H., Yan, J.: LLM4Drive: A Survey of Large Language Models
for Autonomous Driving (2023)
[2] Huang, Y., Chen, Y., et al.: Applications of Large Scale Foundation Models for
Autonomous Driving (2023)
[3] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 (2018)
[4] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language
understanding with unsupervised learning (2018)
[5] Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report. arXiv preprint
arXiv:2303.08774 (2023)
[6] Mao, J., Qian, Y., Ye, J., Zhao, H., Wang, Y.: GPT-Driver: Learning to Drive
with GPT (2023)
[7] Chen, L., Sinavski, O., Hünermann, J., Karnsund, A., Willmott, A.J., Birch, D.,
Maund, D., Shotton, J.: Driving with LLMs: Fusing Object-Level Vector Modality
for Explainable Autonomous Driving (2023)
[8] Cui, C., Ma, Y., Cao, X., Ye, W., Wang, Z.: Receive, Reason, and React: Drive
as You Say with Large Language Models in Autonomous Vehicles (2023)
[9] Keysan, A., Look, A., Kosman, E., Gürsun, G., Wagner, J., Yao, Y., Rakitsch,
B.: Can you text what is happening? Integrating pre-trained language encoders
into trajectory prediction models for autonomous driving (2023)
[10] Yang, Y., Zhang, Q., Li, C., Marta, D.S., Batool, N., Folkesson, J.: Human-Centric
Autonomous Systems With LLMs for User Command Reasoning (2023)
17
[11] Wang, S., Sheng, Z., et al.: Adept: A testing platform for simulated autonomous
driving. In: Proceedings of the 37th IEEE/ACM International Conference on
Automated Software Engineering, pp. 1–4 (2022)
[12] Deng, Y., Yao, J., Tu, Z., Zheng, X., Zhang, M., Zhang, T.: TARGET: Automated
Scenario Generation from Traffic Rules for Testing Autonomous Vehicles (2023)
[13] Tan, S., Ivanovic, B., Weng, X., Pavone, M., Kraehenbuehl, P.: Language
Conditioned Traffic Generation (2023)
[14] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang,
C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow
instructions with human feedback. Advances in Neural Information Processing
Systems 35, 27730–27744 (2022)
[15] Mao, J., Ye, J., Qian, Y., Pavone, M., Wang, Y.: A Language Agent for
Autonomous Driving (2023)
[16] Sha, H., Mu, Y., et al.: LanguageMPC: Large Language Models as Decision
Makers for Autonomous Driving (2023)
[17] Wen, L., Fu, D., et al.: DiLu: A Knowledge-Driven Approach to Autonomous
Driving with Large Language Models (2023)
[18] Jin, Y., Shen, X., Peng, H., Liu, X., Qin, J., Li, J., Xie, J., Gao, P., Zhou, G., Gong,
J.: SurrealDriver: Designing Generative Driver Agent Simulation Framework in
Urban Contexts based on Large Language Model (2023)
[19] Wang, M., Zhang, Z., Yang, G.H.: Incorporating Voice Instructions in Model-
Based Reinforcement Learning for Self-Driving Cars (2022)
[20] Kaufmann, T., Weng, P., Bengs, V., Hüllermeier, E.: A survey of reinforcement
learning from human feedback. arXiv preprint arXiv:2312.14925 (2023)
[21] Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.:
Direct preference optimization: Your language model is secretly a reward model.
Advances in Neural Information Processing Systems 36 (2024)
[22] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E.,
Zhang, H., Stoica, I.: Efficient memory management for large language model serv-
ing with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium
on Operating Systems Principles (2023)
[23] Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: GPTQ: Accurate Post-
Training Quantization for Generative Pre-trained Transformers (2022)
[24] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G.,
Dang, X., Gan, C., Han, S.: AWQ: Activation-aware Weight Quantization for
18
LLM Compression and Acceleration (2023)
[25] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W.,
Keutzer, K.: SqueezeLLM: Dense-and-Sparse Quantization (2023)
[26] Sautier, C., Puy, G., Gidaris, S., Boulch, A., Bursuc, A., Marlet, R.: Image-to-
lidar self-supervised distillation for autonomous driving data. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
9891–9901 (2022)
[27] Mahmoud, A., Hu, J.S., Kuai, T., Harakeh, A., Paull, L., Waslander, S.L.: Self-
supervised image-to-point distillation via semantically tolerant contrastive loss.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 7102–7110 (2023)
[28] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660
(2021)
[29] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust
visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
[30] Kirillov, A., Mintun, E., et al.: Segment anything. arXiv preprint
arXiv:2304.02643 (2023)
[31] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper-
vised learning using nonequilibrium thermodynamics. In: International Confer-
ence on Machine Learning, pp. 2256–2265 (2015). PMLR
[32] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui,
B., Yang, M.-H.: Diffusion models: A comprehensive survey of methods and
applications. ACM Computing Surveys 56(4), 1–39 (2023)
[33] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 10684–10695
(2022)
[34] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances
in neural information processing systems 33, 6840–6851 (2020)
[35] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114 (2013)
19
[36] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany,
October 5-9, 2015, Proceedings, Part III 18, pp. 234–241 (2015). Springer
[37] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M.,
Sutskever, I.: Zero-shot text-to-image generation. In: International Conference on
Machine Learning, pp. 8821–8831 (2021). PMLR
[38] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
1(2), 3 (2022)
[39] Zhang, D., Liang, D., et al.: SAM3D: Zero-Shot 3D Object Detection via Segment
Anything Model (2023)
[40] Peng, X., Chen, R., et al.: Learning to Adapt SAM for Segmenting Cross-domain
Point Clouds (2023)
[41] Liu, S., Zeng, Z., et al.: Grounding dino: Marrying dino with grounded pre-
training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
[42] Cheng, Y., Li, L., Xu, Y., Li, X., Yang, Z., Wang, W., Yang, Y.: Segment and
Track Anything (2023)
[43] Hu, A., Russell, L., et al.: GAIA-1: A Generative World Model for Autonomous
Driving (2023)
[44] Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video
Diffusion Models (2022)
[45] Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: DriveDreamer: Towards
Real-world-driven World Models for Autonomous Driving (2023)
[46] Zhang, L., Xiong, Y., Yang, Z., Casas, S., Hu, R., Urtasun, R.: Learning
Unsupervised World Models for Autonomous Driving via Discrete Diffusion
(2023)
[47] Oord, A.v.d., Vinyals, O., Kavukcuoglu, K.: Neural discrete representation
learning. arXiv preprint arXiv:1711.00937 (2017)
[48] Radford, A., Kim, J.W., et al.: Learning transferable visual models from natu-
ral language supervision. In: International Conference on Machine Learning, pp.
8748–8763 (2021). PMLR
[49] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
[50] Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning
20
segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
[51] Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao,
L., Song, X., et al.: Cogvlm: Visual expert for pretrained language models. arXiv
preprint arXiv:2311.03079 (2023)
[52] Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual
language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
[53] Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K.,
Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod-
els transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818
(2023)
[54] Ding, X., Han, J., Xu, H., Zhang, W., Li, X.: HiLM-D: Towards High-Resolution
Understanding in Multimodal Large Language Models for Autonomous Driving
(2023)
[55] Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre-
training with Frozen Image Encoders and Large Language Models (2023)
[57] Yang, S., Liu, J., Zhang, R., Pan, M., Guo, Z., Li, X., Chen, Z., Gao, P., Guo,
Y., Zhang, S.: Lidar-llm: Exploring the potential of large language models for 3d
lidar understanding. arXiv preprint arXiv:2312.14074 (2023)
[58] Atakishiyev, S., Salameh, M., et al.: Explaining autonomous driving actions with
visual question answering. arXiv preprint arXiv:2307.10408 (2023)
[59] Wen, L., Yang, X., Fu, D., Wang, X., Cai, P., Li, X., Ma, T., Li, Y., Xu, L.,
Shang, D., Zhu, Z., Sun, S., Bai, Y., Cai, X., Dou, M., Hu, S., Shi, B., Qiao, Y.:
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model
on Autonomous Driving (2023)
[60] Xu, Z., Zhang, Y., Xie, E., Zhao, Z., Guo, Y., Wong, K.K., Li, Z., Zhao,
H.: Drivegpt4: Interpretable end-to-end autonomous driving via large language
model. arXiv preprint arXiv:2310.01412 (2023)
[61] Reis, D., Kupec, J., Hong, J., Daoudi, A.: Real-Time Flying Object Detection
with YOLOv8 (2023)
[62] Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for
self-driving vehicles. In: Proceedings of the European Conference on Computer
Vision (ECCV), pp. 563–578 (2018)
[63] Li, H., Li, Y., et al.: Open-sourced Data Ecosystem in Autonomous Driving: the
21
Present and Future (2024)
22