SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Cheng, An-Chieh; Yin, Hongxu; Fu, Yang; Guo, Qiushan; Yang, Ruihan; Kautz, Jan; Wang, Xiaolong; Liu, Sifei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.01584 (cs)

[Submitted on 3 Jun 2024 (v1), last revised 15 Oct 2024 (this version, v3)]

Title:SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Authors:An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu

View PDF HTML (experimental)

Abstract:Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark are released at this https URL

Comments:	NeurIPS 2024, Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.01584 [cs.CV]
	(or arXiv:2406.01584v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.01584

Submission history

From: An-Chieh Cheng [view email]
[v1] Mon, 3 Jun 2024 17:59:06 UTC (4,931 KB)
[v2] Tue, 18 Jun 2024 21:24:46 UTC (4,924 KB)
[v3] Tue, 15 Oct 2024 01:16:20 UTC (5,187 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators