LIFT-GS:

Cross-Scene Render-Supervised Distillation for 3D Language Grounding

Ang Cao^{1, 2}, Sergio Arnaud², Oleksandr Maksymets², Jianing Yang^1,2, Ayush Jain^2,3, Sriram Yenamandra^{2, 4}, Ada Martin², Vincent-Pierre Berges² Paul McVay² Ruslan Partsey² Aravind Rajeswaran² Franziska Meier² Justin Johnson¹ Jeong Joon Park¹ Alexander Sax²

¹University of Michigan, ²Fundamental AI Research (FAIR), Meta ³Carnegie Mellon University ⁴Stanford University

Paper arXiv Code(coming soon)

Overview

TL;DR: LIFT-GS trains a 3D vision-language grounding (3D VLG) model using only 2D supervision; through pixel-based losses and differentiable rendering.

LIFT-GS is an approach to training 3D vision-language understanding and grounding models, that is supervised only using pixel-based losses and differentiable rendering. This makes it possible to train 3D models without 3D labels, and without placing constraints on the network architecture (e.g. can be used with decoder-only transformer models). During training, LIFT-GS requires only images, 2D labels, and pointmaps (e.g. from depth and camera pose). We can even remove the need for 2D labels; instead using 2D pseudo-labels from pretrained models. LIFT-GS demonstrates this, using 2D pseudo-labels to pretrain a model for 3D vision-language understanding tasks. Finetuning LIFT-GS for 3D vision-language grounding outperforms existing SotA models, and also outperforms other 3D pretraining techniques.

3D Referential Grounding with LIFT-GS: Using language queries and sparse point clouds as input, LIFT-GS densely reconstructs the scene (in Gaussian Splatting) and grounds the nouns in the 3D scene.

Reconstruction, Recognition, and Reorganization (Three R)

Differentiable rendering can be used with many types of frame-based losses. LIFT-GS demonstrates that it can be used to train a single model for all three R's of computer vision: 3D reconstruction, open-vocabulary recognition, and 3D segmentation (reorganization)—all without any 3D supervision.

Recent works show that even high-resolution image-to-Gaussian models can be trained/finetuned directly from images (L4GM, AVAT3R). LIFT-GS demonstrates that differentiable rendering can be used to train models not just for reconstruction, but also models that predict and ground 3D instance masks using open-vocabulary referring expressions.

Learning from 2D Foundation Models

Since 3D mask data is scarce, LIFT-GS leverages 2D foundation-scale models to generate pseudolabels directly on the observed frames. It's these 2D pseudolabels that are used to supervise the 3D model. During training, the model outputs are rendered to features and 2D masks via Gaussian Splatting, and the pseudolabels are used for frame-based supervision. The render-supervised distillation approach is largely agnostic to both architecture and task; in that it can be used to train any 3D models as long as the outputs are renderable to 2D.

As a Pretraining Pipeline

Finetuning the distilled weights on existing 3D labels, LIFT-GS can significantly outperform both its non-distilled counterpart and also SotA baselines. In the figure below, all pretraining and finetuning is done on the same scenes.

Scaling

LIFT-GS exhibits robust scaling properties. Our experiments demonstrate a clear "dataset multiplier" effect, where pretraining effectively amplifies the value of finetuning data; consistent with established scaling laws for transfer learning.

Finetuning Data Scaling

Importantly, these gains do not diminish as finetuning data increases to 100%, which indicates that 3D VLG models are currently operating in the severely data-constrained regime.

We finetune the pretrained model with different amounts of 3D data, and find that pretraining effectively multiplies the fine-tuning dataset. This "dataset multiplier" phenomenon is consistent with established scaling laws for transfer.

Pretraining Data Scaling

This suggests that using render-supervision along with foundation-scale image/video data offers a promising approach to scaling 3D vision-language models.

Adding more data to pretraining consistently improves the performance of finetuning.

Improved Pseudo-Labeling

And our pipeline allows a flexible usage of 2D foundation models and pseudo-labeling strategies.

LIFT-GS shows performance gains with larger 2D foundation models and better pseudo-labeling designs.

BibTeX

@article{liftgs2025,
  author    = {Cao, Ang and Arnaud, Sergio and Maksymets, Oleksandr and Yang, Jianing and Jain, Ayush and Yenamandra, Sriram and Martin, Ada and Berges, Vincent-Pierre and McVay, Paul and Partsey, Ruslan and Rajeswaran, Aravind and Meier, Franziska and Johnson, Justin and Park, Jeong Joon and Sax, Alexander},
  title     = {LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding},
  year      = {2025},
}