DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Mengshi Qi¹, Pengfei Zhu¹, Xiangtai Li², Xiaoyang Bi¹, Lu Qi³, Huadong Ma¹, Ming-Hsuan Yang³

¹State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
²Nanyang Technological University, Singapore ³University of California, Merced

This repository provides the official IEEE TPAMI implementation of DC-SAM. We propose a novel dual-consistency framework to enable Segment Anything Models (SAM) to perform in-context segmentation across both images and videos. By enforcing spatial and temporal consistency, DC-SAM achieves superior generalization for zero-shot and interactive segmentation tasks without requiring extensive per-scene fine-tuning.

News

[2025/12/17] Our Paper is accepted by IEEE TPAMI!

[2025/03/17] The validation set for IC-VOS is now released. Enjoy exploring and working with it!

[2025/02/08] The code and dataset for Image-to-image / Image-to-video In-Context Learning is released. Enjoy it:)

Highlights

Dual Consistency SAM (DC-SAM) for one-shot segmentation: Fully explores positive and negative features of visual prompts, generating high-quality prompts tailored for one-shot segmentation tasks.
Query cyclic-consistent cross-attention mechanism: Ensures precise focus on key areas, effectively filtering out confusing components and improving accuracy and specificity in one-shot segmentation.
New video in-context segmentation benchmark (IC-VOS): Introduces a manually collected benchmark from existing video datasets, providing a robust platform for evaluating state-of-the-art methods.
Extension to SAM2 with mask tube design: Enhances prompt generation for video object segmentation, achieving strong performance on the proposed IC-VOS benchmark.

Benchmark

We establish a rigorous benchmark for In-Context Video Object Segmentation (IC-VOS) by adapting several classic datasets into the in-context paradigm. In this setting, the model must segment the target object in a query video based on a provided reference frame + mask pair.

Results

DC-SAM significantly outperforms existing in-context learners and SAM-based variants by maintaining superior consistency.

Getting Started

Step 1: clone this repository:

git clone https://github.com/zaplm/DC-SAM.git && cd DC-SAM

Step 2: create a conda environment and install the dependencies:

conda create -n DCSAM python=3.10 -y
conda activate DCSAM
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
cd segment-anything-2 && pip install -e .
cd ../segment-anything && pip install -e .
cd .. && pip install -r requirements.txt

Preparing Datasets

Download following datasets:

1. PASCAL-5ⁱ

Download PASCAL VOC2012 devkit (train/val data):
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
Download PASCAL VOC2012 SDS extended mask annotations from [Google Drive].

2. COCO-20ⁱ

Download COCO2014 train/val images and annotations:
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
Download COCO2014 train/val annotations. train2014.zip [Google Drive], [val2014.zip]. (and locate both train2014/ and val2014/ under annotations/ directory).

3. IC-VOS

Download our proposed IC-VOS validation set from [Baidu Netdisk]/[HuggingFace dataset]. (COCO2014 train/val annotations and PASCAL VOC2012 SDS extended mask annotations can also be accessed via our [HuggingFace dataset])

Create a directory '../dataset' for the above few-shot segmentation datasets and appropriately place each dataset to have following directory structure:

Datasets/
  ├── VOC2012/            # PASCAL VOC2012 devkit
  │   ├── Annotations/
  │   ├── ImageSets/
  │   ├── ...
  │   └── SegmentationClassAug/
  ├── COCO2014/           
  │   ├── annotations/
  │   │   ├── train2014/  # (dir.) training masks (from Google Drive) 
  │   │   ├── val2014/    # (dir.) validation masks (from Google Drive)
  │   │   └── ..some json files..
  │   ├── train2014/
  │   └── val2014/
  └── IC-VOS/
      ├── 0a43a414/
      │   ├── Annotations/
      │   └── JPEGImages/
      └── ...

Preparing Pretrained Weights

ResNet-50/101_v2 can be download from Google Drive

VGG-16_bn can be download from here

SAM checkpoint can be download from this repository

SAM2 checkpoint can be download from this repository

Training & Evaluation

Training

It is recommend to train DC-SAM using 4 GPUs:

Train on image:

python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_image.py --epochs 50/100 --benchmark coco/pascal --lr 1e-4/2e-4 --bsz 2 --nshot 1 \
   --num_query 25 --sam_version 1/2 --nworker 8 --fold 0/1/2/3 --backbone resnet50/vgg16 --logpath log_name

Train on video (pretrain or finetune of image-to-video):

python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_video.py --epochs 40/10 --benchmark coco_all/coco_mask_tube --lr 1e-4/1e-5 --bsz 8/1 --nshot 1 \
   --num_query 25 --data_type image/video --sam_version 2 --nworker 8 --backbone resnet50 --logpath log_name

Validation

To evaluate on the IC-VOS:

python eval_video.py --coco_path /path/to/coco --icvos_path /path/to/icvos --ckpt /path/to/ckpt

To evaluate on few-shot segmentation benchmarks:

python eval_iamge.py --datapath /path/to/benchmark --benchmark pascal/coco --fold 0/1/2/3 --ckpt /path/to/ckpt

Citation

@article{qi2025dc,
  title={DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency},
  author={Qi, Mengshi and Zhu, Pengfei and Ma, Huadong and Qi, Lu and Li, Xiangtai and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2504.12080},
  year={2025}
}

License

This repository is licensed under Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
common		common
data		data
model		model
resources		resources
segment-anything-2		segment-anything-2
segment-anything		segment-anything
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SAM_plugin.py		SAM_plugin.py
eval_image.py		eval_image.py
eval_video.py		eval_video.py
requirements.txt		requirements.txt
train_image.py		train_image.py
train_video.py		train_video.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Table of Contents

News

Highlights

Benchmark

Results

Getting Started

Preparing Datasets

1. PASCAL-5ⁱ

2. COCO-20ⁱ

3. IC-VOS

Preparing Pretrained Weights

Training & Evaluation

Training

Validation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

zaplm/DC-SAM

Folders and files

Latest commit

History

Repository files navigation

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Table of Contents

News

Highlights

Benchmark

Results

Getting Started

Preparing Datasets

1. PASCAL-5i

2. COCO-20i

3. IC-VOS

Preparing Pretrained Weights

Training & Evaluation

Training

Validation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

1. PASCAL-5ⁱ

2. COCO-20ⁱ

Packages