Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
The task
# classes
Dynamic Scene Understanding
The task
tree
car
person
road
Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding
tree
car
person
Panoptic
segmentation
road
Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding
The task
Multi-object
tracking &
segmentation
Panoptic
segmentation
Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding
The task
4D panoptic
segmentation
Multi-object
tracking &
segmentation
Panoptic
segmentation
Semantic
segmentation # classes
A handful Thousands Open-world
Dynamic Scene Understanding
The task
4D panoptic
segmentation
REAL-WORLD
Annotating
Multi-object 100 similar
tracking & frames…
segmentation Sure some
more data
Panoptic
segmentation
Let’s get
some data!
Semantic
segmentation # classes
A handful Thousands Open-world
Dynamic Scene Understanding
The task
4D panoptic
segmentation PSEUDO-
Annotating
LABELS
Multi-object 100 similar
tracking & frames…
segmentation Sure some
more data
Panoptic
segmentation
Let’s get
some data!
Semantic
segmentation # classes
A handful Thousands Open-world
Better Call SAL:
Towards Learning to Segment
Anything in Lidar
A. Osep et al. “Better call SAL: towards learning to segment anything in Lidar. arXiv:2403.13129
Hello SAL
segment trash cans and fire hydrants.
A. Osep et al. “Better call SAL: towards learning to segment anything in Lidar. arXiV
Better Call SAL
(Towards) a Lidar foundation model
Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
large-scale data (as in Pseudo-Label
2D)? Engine P
1. car
Or benefit from existing 2D (
2. person
3. road annotations/models?
CLIP SAM
…
C. traffic sign
Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
Pseudo-Label Engine
large-scale data in 3D? P
1. car
Or benefit from existing 2D (
2. person
3. road annotations/models?
CLIP SAM
…
C. traffic sign
Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
Pseudo-Label Engine
large-scale data in 3D? P
1. car
Or benefit from existing 2D (
2. person
3. road annotations/models!
CLIP SAM
…
C. traffic sign
Unlabeled
Camera and Lidar Data
SAL Training With
Pseudo-Label Engine Pseudo-Labels
SAL
(Distillation) Model
CLIP SAM
antics
SAL Model
1. car
2. person
3. road
…
C. traffic sign
SAL Model
Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances
Class-agnostic segmentation.
The SAL model
segments and classifies
SAL Model
Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances
CLIP token
1. car Object Queries
2. person … …
3. road
…
C. traffic sign
CLIP
…
Zero-shot classification.
The SAL model
segments and classifies
SAL Model
Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances
CLIP token
1. car Object Queries
2. person … …
3. road
…
C. traffic sign
CLIP
…
How do we train
such a segment anything model?
The Pseudo-Label Engine
Label transfer from 2D to 3D
SAL in action
Zero-shot panoptic segmentation
SAL in action
Text prompting beyond class vocabularies
Easily scalable, get data --> pseudo-label --> re-train the model
There are still some instances you are not going to recognize. Is there anything we can do?
Moving objects are more critical… so why not find a solution for those?
SeMoLi: Motion-inspired Pseudo-Labeling
for 3D Object Detection in Lidar
Inputs:
Inputs:
Pseudo-label generation:
Pseudo-label generation:
• Create a graph, where nodes are • Message Passing: nodes and • Classify edges into active or not +
point trajectories and edges edges communicate, update their correlation clustering + extracting
connect trajectories that might respective features. bounding boxes
belong to the same object.
Evaluation of SeMoLi
Object Detection and Cross-Dataset Generalization
Baseline (DBSCAN++): Najibi et al. Motion inspired unsupervised perception and prediction in autonomous driving. ECCV 2022.
Take home messages
Our goal is to open up possibilities in 3D without requiring labeled data (in 3D)
Questions?