0% found this document useful (0 votes)
14 views35 pages

Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024

The document discusses advancements in dynamic scene understanding for embodied autonomous agents, focusing on various segmentation techniques such as semantic, panoptic, and multi-object tracking. It introduces the SAL model for Lidar segmentation, which utilizes pseudo-labeling and text prompting to classify and segment objects in 3D. Additionally, it highlights the SeMoLi method for motion-inspired pseudo-labeling to enhance 3D object detection using minimal labeled data.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024

The document discusses advancements in dynamic scene understanding for embodied autonomous agents, focusing on various segmentation techniques such as semantic, panoptic, and multi-object tracking. It introduces the SAL model for Lidar segmentation, which utilizes pseudo-labeling and text prompting to classify and segment objects in 3D. Additionally, it highlights the SeMoLi method for motion-inspired pseudo-labeling to enhance 3D object detection using minimal labeled data.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Open-world Segmentation and Tracking in 3D

Laura Leal-Taixé | GTC | March 18-21, 2024


Towards Embodied Autonomous Agents

What is around me? How do they move? Where am I?


Towards Embodied Autonomous Agents

What is around me? How do they move? Where am I?


Dynamic Scene Understanding

The task

# classes
Dynamic Scene Understanding

The task

tree

car

person

road
Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding

The task Understand


person 1
every
person 2
pixel of a video
person 3

tree

car

person

Panoptic
segmentation

road
Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding

The task

Multi-object
tracking &
segmentation

Panoptic
segmentation

Semantic
segmentation # classes
A handful Thousands
Dynamic Scene Understanding

The task

4D panoptic
segmentation

Multi-object
tracking &
segmentation

Panoptic
segmentation

Semantic
segmentation # classes
A handful Thousands Open-world
Dynamic Scene Understanding

The task

4D panoptic
segmentation
REAL-WORLD

Annotating
Multi-object 100 similar
tracking & frames…
segmentation Sure some
more data

Panoptic
segmentation
Let’s get
some data!

Semantic
segmentation # classes
A handful Thousands Open-world
Dynamic Scene Understanding

The task

4D panoptic
segmentation PSEUDO-
Annotating
LABELS
Multi-object 100 similar
tracking & frames…
segmentation Sure some
more data

Panoptic
segmentation
Let’s get
some data!

Semantic
segmentation # classes
A handful Thousands Open-world
Better Call SAL:
Towards Learning to Segment
Anything in Lidar

A. Osep et al. “Better call SAL: towards learning to segment anything in Lidar. arXiv:2403.13129
Hello SAL
segment trash cans and fire hydrants.

A. Osep et al. “Better call SAL: towards learning to segment anything in Lidar. arXiV
Better Call SAL
(Towards) a Lidar foundation model

Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
large-scale data (as in Pseudo-Label
2D)? Engine P

1. car
Or benefit from existing 2D (
2. person
3. road annotations/models?
CLIP SAM

C. traffic sign

Text Prompts Instances + Semantics


Better Call SAL
(Towards) a Lidar foundation model

Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
Pseudo-Label Engine
large-scale data in 3D? P

1. car
Or benefit from existing 2D (
2. person
3. road annotations/models?
CLIP SAM

C. traffic sign

Text Prompts Instances + Semantics


Better Call SAL
(Towards) a Lidar foundation model

Inference: Training:
Segment Anythi
Vision Foundation Mod
SAL
Model
Segment
Unlabeledand classify any object
Camera and Lidar Data
Lidar Point Cloud by manually annotating SAL T
Pseudo-Label Engine
large-scale data in 3D? P

1. car
Or benefit from existing 2D (
2. person
3. road annotations/models!
CLIP SAM

C. traffic sign

Text Prompts Instances + Semantics


Overview of SAL
Training: Pseudo-labeling and model

Segment Anything in Lidar:


Vision Foundation Model to Lidar Distillation

Unlabeled
Camera and Lidar Data
SAL Training With
Pseudo-Label Engine Pseudo-Labels
SAL
(Distillation) Model

CLIP SAM

antics

• Pseudo-label engine: Transfer from 2D foundation models to 3D labels


• SAL model: Zero-shot segmentation via text prompting
The SAL model
segments and classifies.

SAL Model

Lidar Point Cloud Instances

1. car
2. person
3. road

C. traffic sign

Text Prompts Semantics


The SAL model
segments and classifies

SAL Model

Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances

1. car Object Queries


2. person
3. road

C. traffic sign

Text Prompts Semantics

Class-agnostic segmentation.
The SAL model
segments and classifies

SAL Model

Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances

CLIP token
1. car Object Queries
2. person … …
3. road

C. traffic sign
CLIP

Text Prompts Semantics

Zero-shot classification.
The SAL model
segments and classifies

SAL Model

Objectness
Backbone Instance Decoder
Lidar Point Cloud Mask Instances

CLIP token
1. car Object Queries
2. person … …
3. road

C. traffic sign
CLIP

Text Prompts Semantics

How do we train
such a segment anything model?
The Pseudo-Label Engine
Label transfer from 2D to 3D
SAL in action
Zero-shot panoptic segmentation
SAL in action
Text prompting beyond class vocabularies

Hello SAL, segment streetcar.


SAL in action
Text prompting beyond class vocabularies

Hello SAL, segment store front.


SAL in action
Text prompting beyond class vocabularies

Hello SAL, segment curb.


Better Call SAL

Easily scalable, get data --> pseudo-label --> re-train the model

There are still some instances you are not going to recognize. Is there anything we can do?

Moving objects are more critical… so why not find a solution for those?
SeMoLi: Motion-inspired Pseudo-Labeling
for 3D Object Detection in Lidar

J. Seidenschwarz et al. “SeMoLi: what moves together belongs together”. arXiv:2402.19463


Motion-inspired Pseudo-Labeling for 3D Object Detection in Lidar
Problem Statement

Inputs:

Labeled Lidar Streams Unlabeled Lidar Streams


(“just a little bit”) (lots)
Output:

Object Detector (for object observed moving)


Motion-inspired Pseudo-Labeling for 3D Object Detection in Lidar
Problem Statement

Inputs:

Labeled Lidar Streams Unlabeled Lidar Streams


(“just a little bit”) (lots)
Output:

Object Detector (for object observed moving)


Our Method: SeMoLi
Segment Moving in Lidar for Pseudo-Labeling

Pseudo-label generation:

Object detector training:


Our Method: SeMoLi
Segment Moving in Lidar for Pseudo-Labeling

Pseudo-label generation:

Object detector training:


Segmenting moving objects with GNNs

• Create a graph, where nodes are • Message Passing: nodes and • Classify edges into active or not +
point trajectories and edges edges communicate, update their correlation clustering + extracting
connect trajectories that might respective features. bounding boxes
belong to the same object.
Evaluation of SeMoLi
Object Detection and Cross-Dataset Generalization

Waymo Validation Set:

Gap (GT – ours) Improvements


over prior art

Waymo -> Argoverse2 Transfer:

Baseline (DBSCAN++): Najibi et al. Motion inspired unsupervised perception and prediction in autonomous driving. ECCV 2022.
Take home messages

Pseudo-labeling is a powerful tool to leverage strong 2D foundation models for 3D tasks

Geometric and 3D motion cues are still to be explored!

Our goal is to open up possibilities in 3D without requiring labeled data (in 3D)
Questions?

Laura Leal-Taixé | GTC | March 18-21, 2024

You might also like