X-VILA: Cross-Modality Alignment for Large Language Model

Ye, Hanrong; Huang, De-An; Lu, Yao; Yu, Zhiding; Ping, Wei; Tao, Andrew; Kautz, Jan; Han, Song; Xu, Dan; Molchanov, Pavlo; Yin, Hongxu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.19335 (cs)

[Submitted on 29 May 2024]

Title:X-VILA: Cross-Modality Alignment for Large Language Model

Authors:Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

View PDF HTML (experimental)

Abstract:We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

Comments:	Technical Report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2405.19335 [cs.CV]
	(or arXiv:2405.19335v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.19335

Submission history

From: Hanrong Ye [view email]
[v1] Wed, 29 May 2024 17:59:58 UTC (5,032 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:X-VILA: Cross-Modality Alignment for Large Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:X-VILA: Cross-Modality Alignment for Large Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators