Evangelos Kazakos1, Cordelia Schmid2, Josef Sivic1
1Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague
2Inria, Γcole normale supΓ©rieure, CNRS, PSL Research University
π arXiv | π Project Website
- ποΈ 18/01/2027: We updated the dataset download process. You can now download the videos of our HowToGround1M, but also those of the original HowTo100M in addition to iGround. See the updated instructions in the README.
- π 09/11/2025: The HowToGround1M and iGround datasets are now available on π€ Hugging Face: HowToGround1M | iGround. They can be loaded directly with
load_dataset()from the π€ Datasets library. - π€ 02/09/2025: We release grove-transformers β a lightweight, inference-only interface for GROVE, implemented with π€ Transformers.
- π» 21/08/2025: Code, checkpoints, and datasets released!
- π₯ 25/06/2025: Paper accepted to ICCV 2025 π
π BibTeX
@inproceedings{kazakos2025grove,
title = {Large-scale Pre-training for Grounded Video Caption Generation},
author = {Evangelos Kazakos and Cordelia Schmid and Josef Sivic},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}If you only need inference for GROVE (quick experimentation, no training), use the lightweight grove-transformers package instead. Installation instructions are in its README and on the Hugging Face Hub.
If you want to train and/or evaluate GROVE end-to-end, follow the instructions below.
First, create a new conda environment:
conda create -n grove python=3.12
conda activate groveChoose the CUDA version that matches your system (e.g., cu124, cu121, cu118).
Example for CUDA 12.4:
pip install --index-url https://download.pytorch.org/whl/cu124 \
torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install torchtext==0.18.0 torchdata==0.9.0π‘ Replace
cu124in the URL with the correct CUDA version tag for your machine.
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout cfd5d3a985b0249de009b67d04f37263e11cdf3d
pip install -e . --no-build-isolation
cd ..git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout 57c4e25e06e2d4f8a9357c84bcd24089a284dc88
pip install -r requirements/optional.txt
pip install -e . -v
cd ..git clone https://github.com/facebookresearch/sam2.git
cd sam2
git checkout 2b90b9f5ceec907a1c18123530e92e794ad901a4
pip install -e . --no-build-isolation
cd ..pip install flash-attn==2.7.3 --no-build-isolation-
Stanford CoreNLP 3.4.1 (for evaluation in HowToGround1M/iGround)
-
Stanford CoreNLP 4.5.7 (for evaluation in ActivityNet-Entities)
pip install -r requirements.txt-
Download annotations
-
Preprocess annotations
- Run the following command to split the annotations into separate files per video:
bash scripts/preprocess_howtoground_annot.py /path/to/{HowToGround1M,iGround}.pkl target_dir
- Run the following command to split the annotations into separate files per video:
-
Download iGround videos (for fine-tuning/evaluating on iGround)
- Fill in this form to obtain an access token
- Run
$export TOKEN=YOUR_TOKEN - Run the following script to download the iGround videos using the provided links
bash scripts/download_datasets.sh iground /path/to/iground_videos_dir /path/to/iGround_annotations/*_raw.pkl
-
Download HowToGround1M videos (for pre-training on HowToGround1M)
- Fill in this form to obtain an access token
- Run
$export TOKEN=YOUR_TOKEN - Run the following script to download the HowToGround1M videos using the provided links
bash scripts/download_datasets.sh howtoground1m /path/to/howtoground1m_videos_dir /path/to/HowToGround1M_annotations/HowToGround1M_automatic_annotation_method.pkl
Note: The iGround annotations include both processed and raw versions (e.g., iGround_train_set_processed.pkl vs iGround_train_set_raw.pkl). The processed annotations were used to train GROVE. Processing merges multiple instances of the same object type per frame into a single annotation by taking the union of all bounding boxes for that instance. The raw annotations are unprocessed β the same object type may appear multiple times in a frame, each with its own distinct bounding box per instance.
We also provide access to the original full HowTo100M dataset videos. To download:
- Fill in this form to obtain an access token
- Run
$export TOKEN=YOUR_TOKEN - Run the following script to download the Howto100M videos using the provided links
bash scripts/download_datasets.sh howto100m /path/to/howto100m_videos_dir
If you use HowTo100M in your research, please also cite:
@inproceedings{miech19howto100m,
title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips},
author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef},
booktitle={ICCV},
year={2019},
}-
Download ActivityNet videos
- From Hugging Face
-
Download annotations
-
Preprocess videos
bash scripts/preprocess_anet_videos.sh input_dataset_dir preprocessed_dataset_dir
-
Download VidSTG videos
- From Hugging Face
-
Download annotations
- Download GROVE pre-trained on HowToGround1M from link
- Download GROVE fine-tuned on iGround from link
- Download SAM checkpoint from link
- Run:
mkdir checkpoints mv /path/to/checkpoints checkpoints/
- In
train_scripts/train_{howtoground,vidstg,anet}.sh:- (Optional) Modify the sbatch configuration based on your cluster's configuration, though it is suggested to use the provided ones
- Modify the path to the data and checkpoint
- Run:
bash train_scripts/train_{howtoground,vidstg,anet}.sh
Note: train_scripts/train_howtoground.sh can be used for both HowToGround1M and iGround datasets.
Below, it is shown how to run inference & evaluation on iGround validation and test sets. Similarly, for the other datasets use the scripts found in infer_eval_scripts/
- For iGround validation set:
bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_val_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27
- For iGround test set:
bash infer_eval_scripts/infer_eval_iground.sh checkpoints/grove_ft_iground_ckpt.bin /path/to/save/token_embedings.pt /path/to/save/preds.pkl /path/to/iGround_test_set_raw.pkl /path/to/iground_videos_dir 0.5 /path/to/stanford-corenlp-full-2014-08-27
Note: By downloading Stanford CoreNLP from the links provided in the installation instructions, you will get a directory stanford-corenlp-full-2014-08-27 which contains Stanford CoreNLP 3.4.1 (used above for evaluation in iGround) and a directory stanford-corenlp-4.5.7 which contains Stanford CoreNLP 4.5.7 (used for evaluation in ActivityNet-Entities).
