Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [CVPR2025]

Eitan Shaar*, Ariel Shaulov*, Gal Chechik, Lior Wolf

In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions.
To address these challenges, we propose Audio-Visual Adaptive Video Analysis (AV²A), a model-agnostic approach that requires no further training and integrates an score-level fusion technique to retain richer multimodal interactions. AV²A also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that AV²A achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of AV²A on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.

Requirements

conda env create -f environment.yml

Run

# LanguageBind
python main.py --video_dir_path "" --audio_dir_path "" --gpu_id 0 --backbone language_bind --candidates_file_path "" --alpha 0.5 --filter_threshold 0.55 --threshold_stage1 0.75 --threshold_stage2 0.75 --gamma 2.5 --dataset LLP/AVE --method bbse-cosine --fusion early

# CLIP & CLAP
python main.py --video_dir_path "" --audio_dir_path "" --gpu_id 0 --backbone clip_clap --candidates_file_path "" --alpha 0.45 --filter_threshold 0.5 --threshold_stage1 0.75 --threshold_stage2 0.75 --gamma 1 --dataset LLP/AVE --method bbse-cosine --fusion early

Citation

If you find our code useful for your research, please consider citing our paper.

@misc{shaar2025adaptingunknowntrainingfreeaudiovisual,
      title={Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds}, 
      author={Eitan Shaar and Ariel Shaulov and Gal Chechik and Lior Wolf},
      year={2025},
      eprint={2503.13693},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.13693}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
data		data
figs		figs
languagebind		languagebind
.gitignore		.gitignore
README.md		README.md
backbones.py		backbones.py
data_transforms.py		data_transforms.py
dataset.py		dataset.py
environment.yml		environment.yml
eval_metrics.py		eval_metrics.py
label_shift.py		label_shift.py
main.py		main.py
utils.py		utils.py
video_parser_optmizer.py		video_parser_optmizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [CVPR2025]

Requirements

Run

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

eitan159/AV2A

Folders and files

Latest commit

History

Repository files navigation

Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [CVPR2025]

Requirements

Run

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages