Skip to content
/ DC-SAM Public

Official Code for: "DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency"

License

Notifications You must be signed in to change notification settings

zaplm/DC-SAM

Repository files navigation

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency


arXiv Hugging Face Paper TPAMI

1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
2Nanyang Technological University, Singapore  3University of California, Merced

This repository provides the official IEEE TPAMI implementation of DC-SAM. We propose a novel dual-consistency framework to enable Segment Anything Models (SAM) to perform in-context segmentation across both images and videos. By enforcing spatial and temporal consistency, DC-SAM achieves superior generalization for zero-shot and interactive segmentation tasks without requiring extensive per-scene fine-tuning.

Table of Contents

News

[2025/12/17] Our Paper is accepted by IEEE TPAMI!

[2025/03/17] The validation set for IC-VOS is now released. Enjoy exploring and working with it!

[2025/02/08] The code and dataset for Image-to-image / Image-to-video In-Context Learning is released. Enjoy it:)

Highlights

  • Dual Consistency SAM (DC-SAM) for one-shot segmentation: Fully explores positive and negative features of visual prompts, generating high-quality prompts tailored for one-shot segmentation tasks.
  • Query cyclic-consistent cross-attention mechanism: Ensures precise focus on key areas, effectively filtering out confusing components and improving accuracy and specificity in one-shot segmentation.
  • New video in-context segmentation benchmark (IC-VOS): Introduces a manually collected benchmark from existing video datasets, providing a robust platform for evaluating state-of-the-art methods.
  • Extension to SAM2 with mask tube design: Enhances prompt generation for video object segmentation, achieving strong performance on the proposed IC-VOS benchmark.

Benchmark

We establish a rigorous benchmark for In-Context Video Object Segmentation (IC-VOS) by adapting several classic datasets into the in-context paradigm. In this setting, the model must segment the target object in a query video based on a provided reference frame + mask pair.


Results

DC-SAM significantly outperforms existing in-context learners and SAM-based variants by maintaining superior consistency.

Getting Started

Step 1: clone this repository:

git clone https://github.com/zaplm/DC-SAM.git && cd DC-SAM

Step 2: create a conda environment and install the dependencies:

conda create -n DCSAM python=3.10 -y
conda activate DCSAM
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
cd segment-anything-2 && pip install -e .
cd ../segment-anything && pip install -e .
cd .. && pip install -r requirements.txt

Preparing Datasets

Download following datasets:

1. PASCAL-5i

Download PASCAL VOC2012 devkit (train/val data):

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar

Download PASCAL VOC2012 SDS extended mask annotations from [Google Drive].

2. COCO-20i

Download COCO2014 train/val images and annotations:

wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip

Download COCO2014 train/val annotations. train2014.zip [Google Drive], [val2014.zip]. (and locate both train2014/ and val2014/ under annotations/ directory).

3. IC-VOS

Download our proposed IC-VOS validation set from [Baidu Netdisk]/[HuggingFace dataset]. (COCO2014 train/val annotations and PASCAL VOC2012 SDS extended mask annotations can also be accessed via our [HuggingFace dataset])

Create a directory '../dataset' for the above few-shot segmentation datasets and appropriately place each dataset to have following directory structure:

Datasets/
  ├── VOC2012/            # PASCAL VOC2012 devkit
  │   ├── Annotations/
  │   ├── ImageSets/
  │   ├── ...
  │   └── SegmentationClassAug/
  ├── COCO2014/           
  │   ├── annotations/
  │   │   ├── train2014/  # (dir.) training masks (from Google Drive) 
  │   │   ├── val2014/    # (dir.) validation masks (from Google Drive)
  │   │   └── ..some json files..
  │   ├── train2014/
  │   └── val2014/
  └── IC-VOS/
      ├── 0a43a414/
      │   ├── Annotations/
      │   └── JPEGImages/
      └── ...

Preparing Pretrained Weights

ResNet-50/101_v2 can be download from Google Drive

VGG-16_bn can be download from here

SAM checkpoint can be download from this repository

SAM2 checkpoint can be download from this repository

Training & Evaluation

Training

It is recommend to train DC-SAM using 4 GPUs:

Train on image:

python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_image.py --epochs 50/100 --benchmark coco/pascal --lr 1e-4/2e-4 --bsz 2 --nshot 1 \
   --num_query 25 --sam_version 1/2 --nworker 8 --fold 0/1/2/3 --backbone resnet50/vgg16 --logpath log_name

Train on video (pretrain or finetune of image-to-video):

python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=6224 train_video.py --epochs 40/10 --benchmark coco_all/coco_mask_tube --lr 1e-4/1e-5 --bsz 8/1 --nshot 1 \
   --num_query 25 --data_type image/video --sam_version 2 --nworker 8 --backbone resnet50 --logpath log_name

Validation

To evaluate on the IC-VOS:

python eval_video.py --coco_path /path/to/coco --icvos_path /path/to/icvos --ckpt /path/to/ckpt

To evaluate on few-shot segmentation benchmarks:

python eval_iamge.py --datapath /path/to/benchmark --benchmark pascal/coco --fold 0/1/2/3 --ckpt /path/to/ckpt

Citation

@article{qi2025dc,
  title={DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency},
  author={Qi, Mengshi and Zhu, Pengfei and Ma, Huadong and Qi, Lu and Li, Xiangtai and Yang, Ming-Hsuan},
  journal={arXiv preprint arXiv:2504.12080},
  year={2025}
}

License

This repository is licensed under Apache 2.0.

About

Official Code for: "DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •