Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

Zhong, Hanwen; Chen, Jiaxin; Zhang, Yutong; Huang, Di; Wang, Yunhong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.06884 (cs)

[Submitted on 12 Jan 2025]

Title:Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

Authors:Hanwen Zhong, Jiaxin Chen, Yutong Zhang, Di Huang, Yunhong Wang

View PDF HTML (experimental)

Abstract:Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously. Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning. However, their rigid combination hampers both the optimization of MoE and the ef fectiveness of reparameterization of LoRA, leading to sub-optimal performance and low inference speed. In this work, we propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner during training, and reparameterizing the learned structure for efficient inference. Specifically, we firstly develop the MoEfied LoRA structure, which decomposes the pre-trained Transformer into a low-rank MoE structure and employ LoRA to fine-tune the parameters. Subsequently, we take into account the intrinsic asynchronous nature of multi-task learning and devise a learning Quality Retaining (QR) optimization mechanism, by leveraging the historical high-quality class logits to prevent a well-trained task from performance degradation. Finally, we design a router fading strategy to integrate the learned parameters into the original Transformer, archiving efficient inference. Extensive experiments on public benchmarks demonstrate the superiority of our method, compared to the state-of-the-art multi-task learning approaches.

Comments:	Accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.06884 [cs.CV]
	(or arXiv:2501.06884v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.06884

Submission history

From: Hanwen Zhong [view email]
[v1] Sun, 12 Jan 2025 17:41:23 UTC (19,233 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators