This is the official code for the paper - AURELIA : Test‑time Reasoning Distillation in Audio‑Visual LLMs
We introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning.
Clone the repo & install dependencies
git clone https://github.com/schowdhury671/aurelia
conda create -n aurelia python=3.10 -y
conda activate aurelia
pip install openai==0.28.0
pip install google-generativeai
pip install -q -U pytube moviepy
apt-get install -y ffmpegTo run the multi‑agent pipeline you must export valid keys for both OpenAI and Google Gemini APIs:
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."Run the data generation pipeline
cd data_generation
python gen_data.py --save_path "reason_data.json" \
--video_path "sample.mp4" \
--audio_path "sample.mp3" \
--query "What is the most popular food of the country where the loudest instrument originates from?" \
--max_tries 5gen_data.py will iteratively call the chosen LLMs, synthesize the reasoning caption, and stop once the evaluator score ≥ τ.
We benchmark various AV-LLMs following the settings mentioned in Section 5 of the paper.
Given below are the links to the checkpoints of the public AV-LLMs.
AURELIA is evaluated on AVReasonBench – a curated collection of 4,500 audio-visual QA samples paired with gold reasoning chains spanning six diverse tasks:
| 🎯 Task | 📚 Source | 🔢 #Samples |
|---|---|---|
| AV-QA (Music-AVQA) | 🎵 Music-AVQA dataset | 1000 |
| AV-QA (AVSD) | 🎥 AVSD dataset | 1000 |
| AV-Captioning | 📝 VALOR dataset | 1000 |
| AV-Compositional | 🌐 Web-scraped pairs | 1000 |
| AV-GeoIQ | 🗺️ Manually authored | 200 |
| AV-Meme | 🎭 AV-Odyssey Bench + augmentation | 100 |
| Dance–Music Match | 💃 DM-Match | 200 |
| Total | 4500 |
📦 Download the dataset: ➡️ Click here
If you find AURELIA useful in your research, please cite:
@article{chowdhury2025aurelia,
title={AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs},
author={Chowdhury, Sanjoy and Ghani, Hanan and Anand, Nishit and Nag, Sayan and Gao, Ruohan and Elhoseiny, Mohamed and Khan, Salman and Manocha, Dinesh},
journal={arXiv preprint arXiv:2503.23219},
year={2025}
}