Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Yan, Ziang; Li, Zhilin; He, Yinan; Wang, Chenting; Li, Kunchang; Li, Xinhao; Zeng, Xiangyu; Wang, Zilei; Wang, Yali; Qiao, Yu; Wang, Limin; Wang, Yi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.19326 (cs)

[Submitted on 26 Dec 2024 (v1), last revised 30 Jun 2025 (this version, v2)]

Title:Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Authors:Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

View PDF HTML (experimental)

Abstract:Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals although they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at this https URL

Comments:	CVPR2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.19326 [cs.CV]
	(or arXiv:2412.19326v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.19326

Submission history

From: Yi Wang [view email]
[v1] Thu, 26 Dec 2024 18:56:05 UTC (6,124 KB)
[v2] Mon, 30 Jun 2025 13:15:13 UTC (6,124 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators