LingBot-Depth: Masked Depth Modeling for Spatial Perception

LingBot-Depth transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. By jointly aligning RGB appearance and depth geometry in a unified latent space, our model serves as a powerful spatial perception foundation for robot learning and 3D vision applications.

Our approach refines raw sensor depth into clean, complete measurements, enabling:

Depth Completion & Refinement: Fills missing regions with metric accuracy and improved quality
Scene Reconstruction: High-fidelity indoor mapping with a strong depth prior
4D Point Tracking: Accurate dynamic tracking in metric space for robot learning
Dexterous Manipulation: Robust grasping with precise geometric understanding

Artifacts Release

Model Zoo

We provide pretrained models for different scenarios:

Model	Hugging Face Model	ModelScope Model	Description
LingBot-Depth	robbyant/lingbot-depth-pretrain-vitl-14	robbyant/lingbot-depth-pretrain-vitl-14	General-purpose depth refinement
LingBot-Depth-DC	robbyant/lingbot-depth-postrain-dc-vitl14	robbyant/lingbot-depth-postrain-dc-vitl14	Optimized for sparse depth completion

Data Release (Coming Soon)

The curated 3M RGB-D dataset will be released upon completion of the necessary licensing and approval procedures.
Expected release: mid-March 2026.

Installation

Requirements

• Python ≥ 3.9 • PyTorch ≥ 2.0.0 • CUDA-capable GPU (recommended)

From source

git clone https://github.com/robbyant/lingbot-depth
cd lingbot-depth

# Install the package (use 'python -m pip' to ensure correct environment)
conda create -n lingbot-depth python=3.9
conda activate lingbot-depth
python -m pip install -e .

Quick Start

Inference:

import torch
import cv2
import numpy as np
from mdm.model.v2 import MDMModel

# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MDMModel.from_pretrained('robbyant/lingbot-depth-pretrain-vitl-14').to(device)

# Load and prepare inputs
image = cv2.cvtColor(cv2.imread('examples/0/rgb.png'), cv2.COLOR_BGR2RGB)
h, w = image.shape[:2]
image = torch.tensor(image / 255, dtype=torch.float32, device=device).permute(2, 0, 1)[None]

depth = cv2.imread('examples/0/raw_depth.png', cv2.IMREAD_UNCHANGED).astype(np.float32) / 1000.0
depth = torch.tensor(depth, dtype=torch.float32, device=device)[None]

intrinsics = np.loadtxt('examples/0/intrinsics.txt')
intrinsics[0] /= w  # Normalize fx and cx by width
intrinsics[1] /= h  # Normalize fy and cy by height
intrinsics = torch.tensor(intrinsics, dtype=torch.float32, device=device)[None]

# Run inference
output = model.infer(
    image,
    depth_in=depth,
    intrinsics=intrinsics)

depth_pred = output['depth']  # Refined depth map
points = output['points']      # 3D point cloud

Run example:

The model is automatically downloaded from Hugging Face on first run (no manual download needed):

# Basic usage - processes example 0
python example.py

# Use a different example (0-7 available)
python example.py --example 1

# Use depth completion optimized model
python example.py --model robbyant/lingbot-depth-postrain-dc-vitl14

# Custom output directory
python example.py --output my_results

# See all options
python example.py --help

This processes the example data and saves results to result/ (or custom directory):

result/
├── rgb.png                 # Input RGB image
├── depth_input.npy        # Input depth (float32, meters)
├── depth_refined.npy      # Refined depth (float32, meters)
├── depth_input.png        # Input depth visualization
├── depth_refined.png      # Refined depth visualization
├── depth_comparison.png   # Side-by-side comparison
└── point_cloud.ply       # 3D point cloud

Available examples: 8 example scenes (0-7) included in examples/ directory.

Method

We introduce a masked depth modeling approach that learns robust RGB-D representations through self-supervised learning. The model employs a Vision Transformer encoder with specialized depth-aware attention mechanisms to jointly process RGB and depth inputs.

Depth-aware attention visualization. Visualizing attention from depth queries (Q1–Q3, marked with ⋆) to RGB tokens in two scenes: (a) aquarium and (b) indoor shelf. Each row shows masked input depth, attention weights on RGB, and refined output. Different queries attend to spatially corresponding regions, demonstrating cross-modal alignment.

Key Innovations:

Masked Depth Modeling: Self-supervised pre-training via depth reconstruction
Cross-Modal Attention: Joint RGB-Depth alignment in unified latent space
Metric-Scale Preservation: Maintains real-world measurements for downstream tasks

Training Data

Our model is trained on a large-scale diverse dataset combining real-world and simulated RGB-D captures:

Training dataset. 2M real-world and 1M simulated samples spanning diverse indoor environments (top). Representative RGB-D inputs with ground truth depth (bottom).

Dataset Composition:

Real Captures: 2M samples from residential, office, and commercial environments
Simulated Data: 1M photo-realistic renders with perfect ground truth
Modalities: RGB images, raw depth, refined ground truth depth
Diversity: Multiple sensor types, lighting conditions, and scene complexities

Applications

4D Point Tracking

LingBot-Depth provides metric-accurate 3D geometry essential for tracking dynamic targets:

4D point tracking. Robust tracking in gym environments with dynamic human motion. Top: query point selection. Middle: 3D tracking on deforming geometry. Bottom: refined depth maps. Demonstrated on scooter, rowing machine, gym equipment, and pull-up bar.

Dexterous Manipulation

High-quality geometric understanding enables reliable robotic grasping across diverse objects and materials:

Dexterous grasping. Robust manipulation enabled by refined depth. Top: point cloud reconstruction. Bottom: successful grasps on steel cup, glass cup, storage box, and toy car.

Hardware Setup

We developed a scalable RGB-D capture system for large-scale data collection:

RGB-D capture system. Multi-sensor setup with Intel RealSense, Orbbec Gemini, and Azure Kinect for scalable real-world data collection.

Model Details

Architecture

Encoder: Vision Transformer (Large) with RGB-D fusion
Decoder: Multi-scale feature pyramid with specialized heads
Heads: Depth regression
Training: Masked depth modeling with reconstruction objective

Input Format

RGB Image:

Shape: [B, 3, H, W] normalized to [0, 1]
Format: PyTorch tensor, float32

Depth Map:

Shape: [B, H, W]
Unit: Meters (configurable via scale parameter)
Invalid regions: 0 or NaN

Camera Intrinsics:

Shape: [B, 3, 3]
Normalized format: fx'=fx/W, fy'=fy/H, cx'=cx/W, cy'=cy/H

Example:

[[fx/W,   0,   cx/W],
 [  0,  fy/H,  cy/H],
 [  0,    0,    1  ]]

Output Format

The model returns a dictionary:

{
    'depth': torch.Tensor,   # Refined depth [B, H, W]
    'points': torch.Tensor,  # Point cloud [B, H, W, 3] in camera space
}

Inference Parameters

model.infer(
    image,                                   # RGB tensor [B, 3, H, W]
    depth_in=None,                           # Input depth [B, H, W]
    use_fp16=True,                           # Mixed precision inference
    intrinsics=None,                         # Camera intrinsics [B, 3, 3]
)

Citation

If you find this work useful for your research, please cite:

@article{lingbot-depth2026,
  title={Masked Depth Modeling for Spatial Perception},
  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},
  journal={arXiv preprint arXiv:[2601.17895]},
  year={2026}
}

Please also consider citing DINOv2, which serves as our backbone:

@article{oquab2023dinov2,
  title={DINOv2: Learning Robust Visual Features without Supervision},
  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal={Transactions on Machine Learning Research},
  year={2024}
}

License

This project is released under the Apache License 2.0. See LICENSE file for details.

Acknowledgments

This work builds upon several excellent open-source projects:

DINOv2 - Self-supervised vision transformer backbone
Masked Autoencoders - Self-supervised learning framework
The broader open-source computer vision and robotics communities

Contact

For questions, discussions, or collaborations:

Issues: Open an issue on GitHub
Email: Contact Dr. Bin Tan ([email protected]) or Dr. Nan Xue ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
examples		examples
mdm		mdm
.gitignore		.gitignore
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
tech-report.pdf		tech-report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LingBot-Depth: Masked Depth Modeling for Spatial Perception

Artifacts Release

Model Zoo

Data Release (Coming Soon)

Installation

Requirements

From source

Quick Start

Method

Training Data

Applications

4D Point Tracking

Dexterous Manipulation

Hardware Setup

Model Details

Architecture

Input Format

Output Format

Inference Parameters

Citation

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

Robbyant/lingbot-depth

Folders and files

Latest commit

History

Repository files navigation

LingBot-Depth: Masked Depth Modeling for Spatial Perception

Artifacts Release

Model Zoo

Data Release (Coming Soon)

Installation

Requirements

From source

Quick Start

Method

Training Data

Applications

4D Point Tracking

Dexterous Manipulation

Hardware Setup

Model Details

Architecture

Input Format

Output Format

Inference Parameters

Citation

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages