TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Liu, Qinying; Zheng, Kecheng; Wei, Wu; Tong, Zhan; Liu, Yu; Chen, Wei; Wang, Zilei; Shen, Yujun

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.14149v1 (cs)

[Submitted on 21 Dec 2023 (this version), latest version 26 Mar 2024 (v4)]

Title:TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Authors:Qinying Liu, Kecheng Zheng, Wu Wei, Zhan Tong, Yu Liu, Wei Chen, Zilei Wang, Yujun Shen

View PDF HTML (experimental)

Abstract:The crux of learning vision-language models is to extract semantically aligned information from visual and linguistic data. Existing attempts usually face the problem of coarse alignment, \textit{e.g.}, the vision encoder struggles in localizing an attribute-specified object. In this work, we propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs. Concretely, given an image and its paired text, we manage to parse objects (\textit{e.g.}, cat) and attributes (\textit{e.g.}, black) from the description, which are highly likely to exist in the image. It is noteworthy that the parsing pipeline is fully automatic and thus enjoys good scalability. With these parsed semantics as supervision signals, we can complement the commonly used image-text contrastive loss with the multi-tag classification loss. Extensive experimental results on a broad suite of semantic segmentation datasets substantiate the average 3.65\% improvement of our framework over existing alternatives. Furthermore, the visualization results indicate that attribute supervision makes vision-language models accurately localize attribute-specified objects. Project page can be found at this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.14149 [cs.CV]
	(or arXiv:2312.14149v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.14149

Submission history

From: Qinying Liu [view email]
[v1] Thu, 21 Dec 2023 18:59:06 UTC (7,523 KB)
[v2] Tue, 26 Dec 2023 12:23:45 UTC (7,523 KB)
[v3] Mon, 25 Mar 2024 15:06:33 UTC (7,122 KB)
[v4] Tue, 26 Mar 2024 12:47:12 UTC (6,524 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators