Visual Instruction Tuning with Polite Flamingo

Chen, Delong; Liu, Jianfeng; Dai, Wenliang; Wang, Baoyuan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2307.01003 (cs)

[Submitted on 3 Jul 2023 (v1), last revised 15 Dec 2023 (this version, v2)]

Title:Visual Instruction Tuning with Polite Flamingo

Authors:Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang

View PDF HTML (experimental)

Abstract:Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately -- for instance, its "politeness" -- due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations.

Comments:	In AAAI-24
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2307.01003 [cs.CV]
	(or arXiv:2307.01003v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2307.01003

Submission history

From: Delong Chen [view email]
[v1] Mon, 3 Jul 2023 13:37:00 UTC (1,242 KB)
[v2] Fri, 15 Dec 2023 10:09:58 UTC (1,242 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Instruction Tuning with Polite Flamingo

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Instruction Tuning with Polite Flamingo

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators