DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Wang, Zeyu; Lin, Jingyu; Qian, Yifei; Huang, Yi; Tian, Shicen; Chai, Bosong; Deng, Juncan; Yang, Qu; Du, Lan; Chen, Cunjian; Huang, Kejie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15488 (cs)

[Submitted on 22 Jul 2024 (v1), last revised 20 Oct 2024 (this version, v5)]

Title:DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Authors:Zeyu Wang, Jingyu Lin, Yifei Qian, Yi Huang, Shicen Tian, Bosong Chai, Juncan Deng, Qu Yang, Lan Du, Cunjian Chen, Kejie Huang

View PDF HTML (experimental)

Abstract:Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a compact and effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ''RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. Meanwhile, it shows the strong potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on FLIR, MFNet, COME15K, and MCXFace datasets. To our knowledge, DiffX is the first model for layout-guided cross-modal image generation. Our code and constructed cross-modal image datasets are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.15488 [cs.CV]
	(or arXiv:2407.15488v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15488

Submission history

From: Zeyu Wang [view email]
[v1] Mon, 22 Jul 2024 09:05:16 UTC (4,113 KB)
[v2] Sun, 28 Jul 2024 11:57:25 UTC (4,797 KB)
[v3] Tue, 6 Aug 2024 12:54:41 UTC (4,913 KB)
[v4] Sun, 25 Aug 2024 02:14:33 UTC (4,915 KB)
[v5] Sun, 20 Oct 2024 15:41:42 UTC (6,523 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators