MixFormerV2: Efficient Fully Transformer Tracking

Cui, Yutao; Song, Tianhui; Wu, Gangshan; Wang, Limin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.15896 (cs)

[Submitted on 25 May 2023 (v1), last revised 7 Feb 2024 (this version, v2)]

Title:MixFormerV2: Efficient Fully Transformer Tracking

Authors:Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang

View PDF

Abstract:Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.

Comments:	NIPS2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.15896 [cs.CV]
	(or arXiv:2305.15896v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.15896

Submission history

From: Yutao Cui [view email]
[v1] Thu, 25 May 2023 09:50:54 UTC (489 KB)
[v2] Wed, 7 Feb 2024 12:20:21 UTC (1,663 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MixFormerV2: Efficient Fully Transformer Tracking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MixFormerV2: Efficient Fully Transformer Tracking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators