TL;DR: LIFT-GS trains a 3D vision-language grounding (3D VLG) model using only 2D supervision; through pixel-based losses and differentiable rendering.
LIFT-GS is an approach to training 3D vision-language understanding and grounding models, that is supervised only using pixel-based losses and differentiable rendering. This makes it possible to train 3D models without 3D labels, and without placing constraints on the network architecture (e.g. can be used with decoder-only transformer models). During training, LIFT-GS requires only images, 2D labels, and pointmaps (e.g. from depth and camera pose). We can even remove the need for 2D labels; instead using 2D pseudo-labels from pretrained models. LIFT-GS demonstrates this, using 2D pseudo-labels to pretrain a model for 3D vision-language understanding tasks. Finetuning LIFT-GS for 3D vision-language grounding outperforms existing SotA models, and also outperforms other 3D pretraining techniques.
3D Referential Grounding with LIFT-GS: Using language queries and sparse point clouds as input, LIFT-GS densely reconstructs the scene (in Gaussian Splatting) and grounds the nouns in the 3D scene.
Differentiable rendering can be used with many types of frame-based losses. LIFT-GS demonstrates that it can be used to train a single model for all three R's of computer vision: 3D reconstruction, open-vocabulary recognition, and 3D segmentation (reorganization)—all without any 3D supervision.
Recent works show that even high-resolution image-to-Gaussian models can be trained/finetuned directly from images (L4GM, AVAT3R). LIFT-GS demonstrates that differentiable rendering can be used to train models not just for reconstruction, but also models that predict and ground 3D instance masks using open-vocabulary referring expressions.
Since 3D mask data is scarce, LIFT-GS leverages 2D foundation-scale models to generate pseudolabels directly on the observed frames. It's these 2D pseudolabels that are used to supervise the 3D model. During training, the model outputs are rendered to features and 2D masks via Gaussian Splatting, and the pseudolabels are used for frame-based supervision. The render-supervised distillation approach is largely agnostic to both architecture and task; in that it can be used to train any 3D models as long as the outputs are renderable to 2D.
Finetuning the distilled weights on existing 3D labels, LIFT-GS can significantly outperform both its non-distilled counterpart and also SotA baselines. In the figure below, all pretraining and finetuning is done on the same scenes.
LIFT-GS exhibits robust scaling properties.
Our experiments demonstrate a clear "dataset multiplier" effect, where pretraining effectively amplifies the value of finetuning data; consistent with established scaling laws for transfer learning.
Importantly, these gains do not diminish as finetuning data increases to 100%, which indicates that 3D VLG models are currently operating in the severely data-constrained regime.
This suggests that using render-supervision along with foundation-scale image/video data offers a promising approach to scaling 3D vision-language models.
And our pipeline allows a flexible usage of 2D foundation models and pseudo-labeling strategies.
@article{liftgs2025,
author = {Cao, Ang and Arnaud, Sergio and Maksymets, Oleksandr and Yang, Jianing and Jain, Ayush and Yenamandra, Sriram and Martin, Ada and Berges, Vincent-Pierre and McVay, Paul and Partsey, Ruslan and Rajeswaran, Aravind and Meier, Franziska and Johnson, Justin and Park, Jeong Joon and Sax, Alexander},
title = {LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding},
year = {2025},
}