HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Wang, Zhonghao; Wei, Wei; Zhao, Yang; Xiao, Zhisheng; Hasegawa-Johnson, Mark; Shi, Humphrey; Hou, Tingbo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.00079 (cs)

[Submitted on 30 Nov 2023]

Title:HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Authors:Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou

View PDF

Abstract:This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2312.00079 [cs.CV]
	(or arXiv:2312.00079v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.00079

Submission history

From: Zhonghao Wang [view email]
[v1] Thu, 30 Nov 2023 02:33:29 UTC (13,906 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators