Customizing Text-to-Image Models With A Single Image Pair
Customizing Text-to-Image Models With A Single Image Pair
Customizing Text-to-Image Models With A Single Image Pair
1 Introduction
Artistic works are often inspired by a reference image, a recurring scene, or even
a previous piece of art [54]. Such creations involve re-interpreting an original
composition in the artist’s unique style. A notable example is Van Gogh’s Repe-
titions [63], in which the artist created multiple versions of the same scenes with
his distinctive expressiveness, including adaptations of other artists’ work. Such
sets of variations allow close comparison of stylized art to a reference image,
providing unique insights into an artist’s detailed techniques and choices.
In our work, we explore how such content-style image pairs can be used to
customize a generative model to capture the demonstrated stylistic difference.
Our goal is to customize a pre-trained generative model to synthesize stylized
images, distilling the essence of the style from as few as a single pair without
fixating on specific content. We wish to create a model capable of re-interpreting
a variety of different content in the style demonstrated by the paired variation.
Prior works on model customization/personalization [18,43,71] take one or a
few images of a single concept to customize large-scale text-to-image models [67,
2 M. Jones et al.
Pretrained
Text-to-Image Pretrained
Diffusion
Image Pair stylistic difference Text Prompt: A photo of a dog in digital illustration style
Pair
Customization
Standard
only train Customization
on style image
Fig. 1: Given a single image pair, we present Pair Customization, a method for cus-
tomizing a pre-trained text-to-image model and learning a new style from the image
pair’s stylistic difference. Our method can apply the learned stylistic difference to
new input images while preserving the input structure. Compared to Dreambooth
LoRA [33, 73], a standard customization method that solely use style images, our
method effectively disentangles style and content, resulting in better structure, color
preservation, and style application. Style image credit: Jack Parkhouse.
69]. While they aim to learn styles without using pairs, the generated samples
from these customized models often resemble the training images’ content, such
as specific objects, persons, and scene layouts. In Figure 1, we observe that
standard single-image customization (3rd row) alters the subject, color tone, and
pose of the original image (1st row). These issues arise because the artistic intent
is difficult to discern from a single image: unlike image pairs that can demonstrate
a style through contrasts, a singleton example will always intertwine choices of
both style and content. Due to this ambiguity, the model fails to capture the
artistic style accurately and, in some cases, overfits and generates the subject-
specific details rather than the style, as shown in Figure 6.
On the other hand, our Pair Customization method exploits the contrast
between image pairs to generate pairwise consistent images while better disen-
tangling style and content. In Figure 1 (2nd row), our method accurately follows
the given style, turning the background into a single color matching the origi-
nal background and preserving the identity and pose for each dog. Our method
achieves this by disentangling the intended style from the image pair.
Our new customization task is challenging since text-to-image models were
not initially designed to generate pairwise content. Even when given specific text
prompts like “ a portrait” and “ a portrait with Picasso style”, a text-to-image
diffusion model often struggles to generate images with consistent structure from
the same noise seed. Therefore, it remains unclear how a customized model can
generate stylized images while maintaining the original structure.
Paired Customization 3
2 Related Works
3 Method
Our method seeks to learn a new style from a single image pair. This task is
challenging, as models tend to overfit when trained on a single image, especially
when generating images in the same category as the training image (e.g., a model
trained and tested on dog photos). To reduce this overfitting, we introduce a new
algorithm aimed at disentangling the structure of the subject from the style of the
artwork. Specifically, we leverage the image pair to learn separate model weights
for style and content. At inference time, we modify the standard classifier-free
guidance formulation to help preserve the original image structure when applying
the learned style. In this section, we give a brief overview of diffusion models,
outline our design choices, and explain the final method in detail.
Diffusion models. Diffusion models [30, 81, 84], map Gaussian noise to the
image distribution through iterative denoising. Denoising is learned by revers-
ing the forward diffusion process x0 , . . . , xT , where image
? x0 is slowly
? diffused
to random noise xT over T timesteps, defined by xt “ α st x0 ` 1 ´ α st ϵ for
timestep t P r0, T s. Noise ϵ „ N p0, Iq is randomly sampled, and α st controls
the noise strength. The training objective of diffusion models is to denoise any
intermediate noisy image xt via noise prediction:
\begin {aligned} \mathbb {E}_{\epsilon ,\x ,\c ,t}\left [w_t \| \epsilon - \epsilon _{\theta }(\x _t, \c , t)\|^2\right ], \end {aligned} (1)
Paired Customization 5
Joint
A photo of a Denoised image Denoising Content image
V* dog
Objective
Same noise seed
Content LoRA Pretrained
A photo of
Pretrained Loss Style LoRA
a dog in digital art
Style LoRA
style
A photo of a Denoised image Style image
V* dog in digital art style
Update Style LoRA
Training: Joint optimize style and content Inference: Generate style images with same layouts
Fig. 2: Method overview. (Left) We disentangle style and content from an image
pair by jointly training two low-rank adapters, StyleLoRA and ContentLoRA, repre-
senting style and content, respectively. Our training objective consists of two losses:
The first loss fine-tunes ContentLoRA to reconstruct content image conditioned on a
content prompt. The second loss encourages reconstructing the style image using both
StyleLoRA and ContentLoRA conditioned on a style prompt, but we only optimize
Style LoRA for this loss. (Right) At inference time, we only apply StyleLoRA to cus-
tomize the model. Given the same noise seed, the customized model generates a stylized
counterpart of the original pre-trained model output. V* is a fixed random rare token
that is a prompt modifier for the content image. Style image credits: Jack Parkhouse
\begin {aligned} W_{\text {LoRA}} = W_0 + \Delta W_{\text {LoRA}} = W_0 + BA. \end {aligned} \label {eqn:lora} (2)
\label {eq:oldinference} \begin {aligned} W_{\text {LoRA}} = W_0 + \alpha \Delta W_{\text {LoRA}}. \end {aligned} (3)
LoRA has been applied for customizing text-to-image diffusion models to learn
new concepts with as few as three to five images [74].
6 M. Jones et al.
\label {eq:content} \begin {aligned} \mathcal {L}_{\text {content}} = \mathbb {E}_{\epsilon ,\x _\text {content},t}\left [w_t\|\epsilon - \epsilon _{\theta _\text {content}}(\x _{t, \text {content}}, \c _\text {content}, t)\|^2\right ], \end {aligned} (4)
where ϵθcontent is the denoiser with content LoRA applied, xt,content is a noisy
content image at timestep t, and ccontent is text representing the content image,
including some rare token V*. Next, we optimize the combined style and content
weights to reconstruct the style image. In particular, we only train the style
LoRA weights during this step, while stopping the gradient flow to the content
LoRA weights via stopgrad sgr¨s:
\label {eq:combination} \begin {aligned} \theta _{\text {combined}} = \theta _0 + \text {sg}[\Delta \theta _{\text {content}}] + \Delta \theta _{\text {style}}. \end {aligned} (5)
\label {eq:combined} \begin {aligned} \mathcal {L}_{\text {combined}} = \mathbb {E}_{\epsilon ,\x _\text {style},t}\left [w_t\|\epsilon - \epsilon _{\theta _\text {combined}}(x_{t, \text {style}}, \c _\text {style}, t)\|^2\right ], \end {aligned} (6)
Paired Customization 7
where ϵθcombined is the denoiser with both LoRAs applied as in Equation 5, cstyle
is “tccontent u in <desc> style”, and <desc> is a worded description of the style
(e.g., “digital art”). Finally, we jointly optimize the LoRAs with the two losses:
\label {eq:combined2} \begin {aligned} \min _{\Delta \theta _{\text {content}}, \Delta \theta _{\text {style}}} \mathcal {L}_{\text {content}} + \mathcal {L}_{\text {combined}} \end {aligned} (7)
\begin {aligned} W_\text {content} = W_0 + B_\text {content}A_\text {content};\;W_\text {style} = W_0 + B_\text {style}A_\text {style}. \end {aligned} (8)
We initialize Bcontent , Bstyle with the zero matrix and choose the rows of
Acontent , Astyle from an orthonormal basis. We then fix Acontent , Astyle and only
update Bcontent , Bstyle in training. This forces the style and content LoRA up-
dates to respond to orthogonal inputs, and empirically reduces visual artifacts, as
shown in Figure 3. This technique is inspired by Po et al. [64]. While their work
focuses on merging multiple customized objects after each is trained separately,
we apply the method for style-content separation during joint training.
\label {eq:cfg} \begin {aligned} \hat {\epsilon }_\theta (\x _t, \mathbf {c}) = \epsilon _\theta (\x _t, \varnothing ) + \lambda _{\text {cfg}} (\epsilon _\theta (\x _t,\mathbf {c}) - \epsilon _\theta (\x _t, \varnothing )), \end {aligned} (9)
8 M. Jones et al.
where ϵ̂θ pxt , c, tq is the new noise prediction, ∅ denotes no conditioning, and
λcfg controls the amplification of text guidance. For notation simplicity, we omit
the timestep t in this equation and subsequent ones.
To improve pairwise consistency between original and stylized content, we
propose an inference algorithm that preserves the original denoising path while
adding controllable style guidance:
\label {eq:newinference} \begin {aligned} \hat {\epsilon }_{\theta _0,\theta _\text {style}}(\x _t, \c , \c _\text {style}) = \; & \epsilon _{\theta _0}(\x _t, \varnothing ) \\ +\ & \lambda _{\text {cfg}} (\epsilon _{\theta _0}(\x _t, \c ) - \epsilon _{\theta _0}(\x _t, \varnothing )) \\ +\ & \lambda _{\text {style}}(\epsilon _{\theta _{\text {style}}}(\x _t, \c _{ \text {style}}) - \epsilon _{\theta _0}(\x _t, \c )), \end {aligned}
(10)
where style guidance is the difference in noise prediction between style LoRA
and the pre-trained model. Style guidance strength is controlled by λstyle , and
setting λstyle “ 0 is equivalent to generating original content. In Figure 3, we
compare our style guidance against scaling LoRA weights (Equation 3), and we
find that our method better preserves the layout. More details and a derivation
of our style guidance are in Appendix B.
Previous works [8,49] have also considered applying multiple guidance terms
with diffusion models. A major difference in our approach is that we obtain
additional guidance from a customized model and apply it to the original model.
Styledrop [82] considers a similar formulation with two guidance terms but for
transformer-based generative models.
Blending multiple learned styles. With a collection of models customized
by our method, we can blend the learned styles as follows. Specifically, given
some set of styles S and strengths λstyle0 , . . . , λstylen , we can blend the style
guidance from each model, and our new inference path is represented by
\begin {aligned} \hat {\epsilon }_{\theta _0,\theta _\text {style}}(\x _t, \c , \c _\text {style}) = \; & \epsilon _{\theta _0}(\x _t, \varnothing ) \\ +\ & \lambda _{\text {cfg}} (\epsilon _{\theta _0}(\x _t, \c ) - \epsilon _{\theta _0}(\x _t, \varnothing )) \\ +\ & \sum _{\text {style}_i \in \mathcal {S}}\lambda _{\text {style}_i}(\epsilon _{\theta _{\text {style}_i}}(\x _t, \c _{ \text {style}_i}) - \epsilon _{\theta _0}(\x _t, \c )), \end {aligned} \label {eqn:blend}
(11)
4 Experiments
4.1 Dataset
In this section, we show our method’s results on various image pairs and compare
them with several baselines. We explain our dataset, baselines, and metrics in
detail, then we present quantitative and qualitative results.
Datasets. To enable large-scale quantitative evaluation, we construct a diverse
set of paired style and content images as follows. First, we generate 40 content im-
ages for each class: headshots, animals, and landscapes. When generating images
in the headshot class, we generate 20 images with the prompt “ A professional
headshot of a man” and 20 images with the prompt “ A professional headshot
of a woman”. Similarly, we split the animal class into photos of dogs and cats. To
curate synthetic pairs, we then apply image editing or image-to-image transla-
tion methods to all the content images to obtain the stylized version. For each
unique prompt, we choose a single paired instance as training data and hold out
the other pairs with the same prompt as a test set (Same Category). For each
prompt, we also choose 5 pairs from each of the other prompts as a secondary
test set (Different Category). We show all our synthetic training image pairs
in Appendix C. By leveraging synthetic pairs for evaluation, we can train on a
single synthetic pair and test our results against held out synthetic style images.
Secondly, we qualitatively compare against single artist pairs in Figure 6. Next,
we describe the specific methods to create the paired dataset.
LEDITS++ [7] is a diffusion-based image editing technique that transforms
an image by updating the inference path of a diffusion model. After fine-grained
10 M. Jones et al.
Baselines. We compare our method against – (1) DreamBooth LoRA [33, 74]
(DB LoRA), (2) Concept Sliders [22] (3) IP-adapters [95], (4) IP-adapters w/
T2I, and (5) StyleDrop [82]. DB LoRA uses only the style image and fine-tunes
low-rank adapters in all the linear layers in the attention blocks of the diffusion
model. We evaluate different amounts of style applications for DB LoRA using
the standard LoRA scale [74] and our style guidance. Concept sliders presents a
paired image model customization method that trains a single low-rank adapter
jointly on both images, with different reconstruction losses for the style and
content images. We also evaluate using both the standard LoRA scale and our
style guidance. IP-adapters is an encoder-based method that does not require
training for every style and takes a style image as an extra condition separate
from the text prompt. Increasing or decreasing the guidance from the input style
image is possible by scaling the weight of the image conditioning. We consider the
SDXL [65] implementation of this method. For the IP-Adapter, we also compare
against the stronger baseline of providing extra conditioning of an edge map
of the content image through T2I Adapters [56] to preserve the content image
structure. The recently proposed Styledrop [82] technique for learning new styles
is based on MUSE [11], and uses human feedback in its method. Since MUSE
is not publicly available, we follow Style-Aligned Image Generation’s [28] setup,
and implement a version of StyleDrop on SDXL. Specifically, we train low-rank
linear layers following each Feed-Forward layer in the attention blocks of SDXL.
For a fair comparison, we train Styledrop without human feedback.
Evaluation metrics. When evaluating the performance of each method, we
consider two quantitative metrics: perceptual distance to ground truth style im-
ages and structure preservation from the original image. A better customization
Paired Customization 11
Fig. 5: Human preference study. Our method is preferred over the baselines (ě
60%). Further, our full method, including orthogonal weight matrices (Section 3.2),
is preferred over the one w/o orthogonal weight matrices, specifically for the same
category as training pair, e.g., trained on a headshot of a man and tested on other
headshots of man. The Gray dashed line denotes 50% chance performance.
method will have a low perceptual distance to the ground truth style images
while still preserving content of the original image before adding style. We mea-
sure these using – (1) Distance to GT Styled : given holdout ground truth style
images, we measure the perceptual distance between our styled outputs and the
ground truth style images using DreamSim [17], a recent method for measur-
ing the perceptual distance between images. DreamSim image embeddings are
comprised of an ensemble of image embedding models, including CLIP [66] and
DINO [10], which are then fine-tuned so the final embeddings respect human
perception. We measure DreamSim distance as (1 - cosine similarity) between
DreamSim embeddings, where a lower value implies that the images are percep-
tually more similar. (2) Distance to Content Image: to measure content preserva-
tion after style application, we measure the perceptual distance of our generated
style image to the original content image with no style guidance. We again use
DreamSim, this time comparing styled and content images. Note here that a
perceptual distance of zero to the content image is undesirable, as this would
require no style to be applied. However, a better-performing method should ob-
tain a better tradeoff between the two distances. (3) We also perform a human
preference study of our method against baselines.
4.3 Results
Fig. 6: Result of our method compared to the strongest baselines. When only training
with the style image as in DB LoRA, the image structure is not preserved and over-
fitting occurs. While Concept Slider’s training scheme [22] uses both style and content
images, it still exhibits overfitting and loss of structure in many cases. Our method
preserves the structure of the input mage while faithfully applying the desired style.
We use style guidance strength 3 and classifier guidance strength 5. Style image credits:
Jack Parkhouse (Third row) and Aaron Hertzmann (Fourth row)
Paired Customization 13
Original Edited
Model Model 0
Fig. 8: Limitations. Left: our method struggles with categories when they signifi-
cantly differ from the training categories. Here, our method fails to transfer the artistic
style of landscape image pairs to human portraits. Right: our method can cause struc-
ture changes in some instances, like change of body position or background changes.
the style concept without memorizing the content of input examples. While our
approach outperforms existing customization methods, it still exhibits several
limitations, as discussed below.
Limitations. First, while our method is able to transfer the style from a pair
of dog images to cat photos, it struggles to handle completely different categories
from the training image pair, particularly when the test category significantly
differs from the training. As shown in Figure 8 (left), our method falls short of
faithfully replicating the style of landscape paintings in some human images.
Second, our current method relies on test-time optimization, which takes
around 15 minutes on a single A5000 GPU. This can be computationally de-
manding if we need to process many image styles. Leveraging encoder-based
approaches [2, 72] for predicting style and content weights in a feed-forward
manner could potentially speed up the customization process.
Paired Customization 15
Finally, our method may occasionally fail to completely maintain input struc-
ture, as demonstrated in Figure 8 (right).
Acknowledgments. We would like to thank Ali Jahanian, Gaurav Parmar,
Ruihan Gao, Sean Liu, and Or Patashnik for their insightful feedback and in-
put that contributed to the finished work. We also thank Jack Parkhouse and
Aaron Hertzmann for providing style images. Maxwell Jones is supported by
the Siebel Scholar fellowship. This project is partly supported by the Amazon
Faculty Research Award, NSF IIS-2239076, Open Philanthropy, and the Packard
Fellowship.
References
1. Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre-
sentation for text-to-image personalization. In: SIGGRAPH Asia (2023) 3
2. Arar, M., Gal, R., Atzmon, Y., Chechik, G., Cohen-Or, D., Shamir, A.,
H. Bermano, A.: Domain-agnostic tuning-encoder for fast personalization of text-
to-image models. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–10 (2023)
4, 14
3. Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-
scene: Extracting multiple concepts from a single image. In: SIGGRAPH Asia
2023 Conference Papers. pp. 1–12 (2023) 4
4. Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila,
T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an
ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3
5. Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.A.: Visual prompt-
ing via image inpainting. In: Adv. Neural Inform. Process. Syst. (2022) 4
6. Bau, D., Liu, S., Wang, T., Zhu, J.Y., Torralba, A.: Rewriting a deep generative
model. In: Eur. Conf. Comput. Vis. (2020) 3
7. Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting,
K., Passos, A.: Ledits++: Limitless image editing using text-to-image models.
arXiv preprint arXiv:2311.16711 (2023) 9
8. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image
editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 18392–18402 (2023) 3, 8, 23
9. Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text
pair dataset. https://github.com/kakaobrain/coyo-dataset (2022) 3
10. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Int. Conf. Com-
put. Vis. (2021) 11
11. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang,
M.H., Murphy, K., Freeman, W.T., Rubinstein, M., et al.: Muse: Text-to-image
generation via masked generative transformers. In: Proceedings of the 40th Inter-
national Conference on Machine Learning. pp. 4055–4075 (2023) 3, 10
12. Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv
preprint arXiv:1612.04337 (2016) 4
13. Chen, W., Hu, H., Li, Y., Rui, N., Jia, X., Chang, M.W., Cohen, W.W.: Subject-
driven text-to-image generation via apprenticeship learning. In: Adv. Neural In-
form. Process. Syst. (2023) 4
16 M. Jones et al.
14. Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot
object-level image customization. arXiv preprint arXiv:2307.09481 (2023) 4
15. Daras, G., Dimakis, A.G.: Multiresolution textual inversion. arXiv preprint
arXiv:2211.17115 (2022) 3
16. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. In:
Int. Conf. Learn. Represent. (2017) 3
17. Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream-
sim: Learning new dimensions of human visual similarity using synthetic data.
arXiv preprint arXiv:2306.09344 (2023) 11, 21
18. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G.,
Cohen-or, D.: An image is worth one word: Personalizing text-to-image generation
using textual inversion. In: The Eleventh International Conference on Learning
Representations (2022) 1, 3
19. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: De-
signing an encoder for fast personalization of text-to-image models. ACM Trans-
actions on Graphics (TOG) (2023) 3
20. Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.:
Encoder-based domain tuning for fast personalization of text-to-image models.
ACM Transactions on Graphics (TOG) 42(4), 1–13 (2023) 4
21. Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.:
Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Trans-
actions on Graphics (TOG) 41(4), 1–13 (2022) 3
22. Gandikota, R., Materzynska, J., Zhou, T., Torralba, A., Bau, D.: Concept
sliders: Lora adaptors for precise control in diffusion models. arXiv preprint
arXiv:2311.12092 (2023) 10, 12, 13
23. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv
preprint arXiv:1508.06576 (2015) 4
24. Gokaslan, A., Cooper, A.F., Collins, J., Seguin, L., Jacobson, A., Patel, M., Fran-
kle, J., Stephenson, C., Kuleshov, V.: Commoncanvas: An open diffusion model
trained with creative-commons images. arXiv preprint arXiv:2310.16825 (2023) 3
25. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications
of the ACM 63(11), 139–144 (2020) 3
26. Gu, Y., Wang, X., Wu, J.Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R.,
Chang, S., Wu, W., et al.: Mix-of-show: Decentralized low-rank adaptation for
multi-concept customization of diffusion models. Advances in Neural Information
Processing Systems 36 (2024) 4
27. Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact
parameter space for diffusion fine-tuning. In: Int. Conf. Comput. Vis. (2023) 3
28. Hertz, A., Voynov, A., Fruchter, S., Cohen-Or, D.: Style aligned image generation
via shared attention. arXiv preprint arXiv:2312.02133 (2023) 4, 10, 21, 22, 23
29. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analo-
gies (2001) 4
30. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv.
Neural Inform. Process. Syst. (2020) 3, 4, 5
31. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint
arXiv:2207.12598 (2022) 3, 7, 26
32. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges-
mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp.
In: International Conference on Machine Learning. pp. 2790–2799. PMLR (2019)
5
Paired Customization 17
33. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W.: Lora: Low-rank adaptation of large language models. In: Int. Conf. Learn.
Represent. (2021) 2, 3, 5, 10, 13
34. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive in-
stance normalization. In: Int. Conf. Comput. Vis. (2017) 4
35. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R.,
Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A.,
Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773,
https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite
it as below. 5
36. Irony, R., Cohen-Or, D., Lischinski, D.: Colorization by example. In: Eurographics
Conference on Rendering Techniques (2005) 4
37. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-
tional adversarial networks. In: IEEE Conf. Comput. Vis. Pattern Recog. (2017)
3
38. Kang, M., Zhu, J.Y., Zhang, R., Park, J., Shechtman, E., Paris, S., Park, T.:
Scaling up gans for text-to-image synthesis. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2023) 3
39. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T.: Training gen-
erative adversarial networks with limited data. In: Adv. Neural Inform. Process.
Syst. (2020) 3
40. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative
adversarial networks. In: IEEE Conf. Comput. Vis. Pattern Recog. (2019) 3
41. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2020) 3
42. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Int. Conf. Learn.
Represent. (2014) 3
43. Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept cus-
tomization of text-to-image diffusion. In: 2023 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR). pp. 1931–1941. IEEE Computer
Society (2023) 1, 3, 4
44. Li, D., Li, J., Hoi, S.: Blip-diffusion: Pre-trained subject representation for con-
trollable text-to-image generation and editing. Advances in Neural Information
Processing Systems 36 (2024) 4
45. Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer. arXiv
preprint arXiv:1701.01036 (2017) 4
46. Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: Gligen: Open-
set grounded text-to-image generation. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2023) 3
47. Liao, J., Yao, Y., Yuan, L., Hua, G., Kang, S.B.: Visual attribute transfer through
deep image analogy. ACM Trans. Graph. 36(4) (jul 2017) 4
48. Liu, L., Ren, Y., Lin, Z., Zhao, Z.: Pseudo numerical methods for diffusion models
on manifolds. In: Int. Conf. Learn. Represent. (2022) 8
49. Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual gen-
eration with composable diffusion models. In: European Conference on Computer
Vision. pp. 423–439. Springer (2022) 8
50. Liu, Z., Feng, R., Zhu, K., Zhang, Y., Zheng, K., Liu, Y., Zhao, D., Zhou, J.,
Cao, Y.: Cones: Concept neurons in diffusion models for customized generation.
In: Int. Conf. Mach. Learn. (2023) 3
18 M. Jones et al.
51. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International
Conference on Learning Representations (2018) 8
52. Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency models:
Synthesizing high-resolution images with few-step inference. arXiv preprint
arXiv:2310.04378 (2023) 3
53. Ma, J., Liang, J., Chen, C., Lu, H.: Subject-diffusion: Open domain person-
alized text-to-image generation without test-time fine-tuning. arXiv preprint
arXiv:2307.11410 (2023) 4
54. Markus, C.: How six different artists have re-interpreted da vinci’s ‘mona lisa’.
https://www.parkwestgallery.com/six-different-artists-da-vinci-mona-
lisa/ (11 2019) 1
55. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided
image synthesis and editing with stochastic differential equations. In: Int. Conf.
Learn. Represent. (2022) 8
56. Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter:
Learning adapters to dig out more controllable ability for text-to-image diffusion
models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38,
pp. 4296–4304 (2024) 3, 10
57. Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri,
I., Pritch, Y., Cohen-Or, D.: Mystyle: A personalized generative prior. In: SIG-
GRAPH ASIA (2022) 3
58. Nitzan, Y., Gharbi, M., Zhang, R., Park, T., Zhu, J.Y., Cohen-Or, D., Shechtman,
E.: Domain expansion of image generators. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2023) 3
59. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.:
Conditional image generation with pixelcnn decoders. In: Adv. Neural Inform.
Process. Syst. (2016) 3
60. Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired
image-to-image translation. In: Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. pp.
319–345. Springer (2020) 4
61. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with
spatially-adaptive normalization. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2019) 3
62. Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Int. Conf.
Comput. Vis. (2023) 3
63. Phillips Collection, T.: Vah gogh repetitions. https://www.phillipscollection.
org/event/2013-10-11-van-gogh-repetitions (10 2013) 1
64. Po, R., Yang, G., Aberman, K., Wetzstein, G.: Orthogonal adaptation for modular
customization of diffusion models. arXiv preprint arXiv:2312.02432 (2023) 3, 4, 7
65. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna,
J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image
synthesis. arXiv preprint arXiv:2307.01952 (2023) 3, 5, 10
66. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models
from natural language supervision. In: Int. Conf. Mach. Learn. (2021) 5, 11
67. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
(2022) 1, 3
68. Reed, S., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Adv.
Neural Inform. Process. Syst. (2015) 4
Paired Customization 19
69. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: IEEE Conf. Comput. Vis. Pattern
Recog. (2022) 1, 3, 5
70. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: International Conference on Medical image comput-
ing and computer-assisted intervention. pp. 234–241. Springer (2015) 5
71. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-
booth: Fine tuning text-to-image diffusion models for subject-driven generation.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 22500–22510 (2023) 1, 3
72. Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubin-
stein, M., Aberman, K.: Hyperdreambooth: Hypernetworks for fast personaliza-
tion of text-to-image models (2023) 4, 14
73. Ryu, S.: Lora-stable diffusion. https://github.com/cloneofsimo/lora (2023) 2,
3
74. Ryu, S.: Low-rank adaptation for fast text-to-image diffusion fine-tuning. https:
//github.com/cloneofsimo/lora (2023) 3, 5, 7, 10, 13
75. Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., Norouzi,
M.: Palette: Image-to-image diffusion models. In: ACM SIGGRAPH 2022 Con-
ference Proceedings. pp. 1–10 (2022) 3
76. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour,
S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al.: Photorealistic text-to-
image diffusion models with deep language understanding. In: NeurIPS (2022)
3
77. Sauer, A., Karras, T., Laine, S., Geiger, A., Aila, T.: Stylegan-t: Unlocking the
power of gans for fast large-scale text-to-image synthesis. In: International con-
ference on machine learning. pp. 30105–30118. PMLR (2023) 3
78. Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta,
A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 3
79. Shah, V., Ruiz, N., Cole, F., Lu, E., Lazebnik, S., Li, Y., Jampani, V.: Ziplora: Any
subject in any style by effectively merging loras. arXiv preprint arXiv:2311.13600
(2023) 4
80. Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: Personalized text-to-image
generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
4
81. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper-
vised learning using nonequilibrium thermodynamics. In: Int. Conf. Mach. Learn.
(2015) 4
82. Sohn, K., Ruiz, N., Lee, K., Chin, D.C., Blok, I., Chang, H., Barber, J., Jiang, L.,
Entis, G., Li, Y., et al.: Styledrop: Text-to-image generation in any style. arXiv
preprint arXiv:2306.00983 (2023) 3, 4, 8, 10
83. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf.
Learn. Represent. (2021) 3
84. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.:
Score-based generative modeling through stochastic differential equations. In:
ICLR (2021) 3, 4
85. Tenenbaum, J., Freeman, W.: Separating style and content. Advances in neural
information processing systems 9 (1996) 4
86. Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for
text-to-image personalization. ACM Transactions on Graphics (TOG) (2023) 3
20 M. Jones et al.
87. Upchurch, P., Snavely, N., Bala, K.: From a to z: supervised transfer of style and
content using deep neural network generators. arXiv preprint arXiv:1603.02003
(2016) 4
88. Valevski, D., Lumen, D., Matias, Y., Leviathan, Y.: Face0: Instantaneously con-
ditioning a text-to-image model on a face. In: SIGGRAPH Asia 2023 Conference
Papers. pp. 1–10 (2023) 4
89. Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: p`: Extended textual condi-
tioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023) 3
90. Wang, S.Y., Bau, D., Zhu, J.Y.: Sketch your own gan. In: Int. Conf. Comput. Vis.
(2021) 3
91. Wang, S.Y., Bau, D., Zhu, J.Y.: Rewriting geometric rules of a gan. ACM SIG-
GRAPH (2022) 3
92. Wang, X., Wang, W., Cao, Y., Shen, C., Huang, T.: Images speak in images:
A generalist painter for in-context visual learning. In: IEEE Conf. Comput. Vis.
Pattern Recog. (2023) 4
93. Wang, X., Yu, J.: Learning to cartoonize using white-box cartoon representations.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. pp. 8090–8099 (2020) 10
94. Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual
concepts into textual embeddings for customized text-to-image generation. In:
Int. Conf. Comput. Vis. (2023) 4
95. Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati-
ble image prompt adapter for text-to-image diffusion models. arXiv preprint
arXiv:2308.06721 (2023) 4, 10, 13
96. Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku,
A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich
text-to-image generation. Transactions on Machine Learning Research (2022) 3
97. Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion
models. In: Int. Conf. Comput. Vis. (2023) 3, 21
98. Zhao, S., Liu, Z., Lin, J., Zhu, J.Y., Han, S.: Differentiable augmentation for
data-efficient gan training. In: Adv. Neural Inform. Process. Syst. (2020) 3
99. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. In: Int. Conf. Comput. Vis. (2017) 4
100. Zou, Z., Shi, T., Qiu, S., Yuan, Y., Shi, Z.: Stylized neural painting. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 15689–15698 (2021) 10
Paired Customization 21
Appendix
In Section A, we evaluate our method against baselines on the diversity met-
ric, showing that our method leads to more diverse generations comparatively.
We also show more qualitative results along with a comparison to the concur-
rent work of Style Aligned Image Generation [28]. In Section B, we then present
details of our style guidance formulation. Finally, in Section C, we provide more
implementation details, including the setup for our human preference study and
the full synthetic training dataset used for evaluation.
\label {eq:overfitting} \begin {aligned} & \text {DreamSim Diversity} = \mathbb {E}_{S \in \mathcal {S}, P \in \mathcal {P}}\left [\mathbb {E}_{i_1,i_2 \in \text {data}_{S, P}}\text {DreamSim}(i_1,i_2)\right ] \end {aligned} (12)
where S is the set of style image pairs, P is the set of prompts, and dataS,P
is the set of images generated with prompt P by a model customized on style
S. DreamSimp¨, ¨q is DreamSim perceptual distance. A decrease in DreamSim
Diversity indicates that all images in a certain domain are becoming perceptually
similar, which may indicate overfitting to the style training image. Methods
that do not overfit the style training image should have higher diversity scores
while also having a low perceptual distance to the ground truth testing style
images. We present our findings in Figure 9. Our method is able to achieve a
low perceptual distance to style ground-truth images while maintaining higher
diversity scores. As shown in Figure 6 in the main paper, the baseline results
mode collapses to the training image, thus lowering their diversity score as they
all become perceptually similar to each other.
Style Aligned Image Generation [28] Baseline This is a recent work for
zero-shot style-consistent image generation from an exemplar style image. Given
the exemplar style image, it is first inverted to a noise map; then for a new text
prompt, the image is generated by attending to both its own self-attention map
and the self-attention map from the style exemplar at every denoising step. We
compare against this baseline by using the style image in our training image
pair as an exemplar and generating a new style image with a new text prompt
using this method. Optionally, we condition this generation on the edge map of
the newly generated image without attention sharing using ControlNet [97] to
help with content preservation. We show the qualitative results of our method
compared to all the variants of this baseline in Figure 12. Figures 10 and 11
show quantitative comparison, where our method outperforms this baseline in
terms of both style similarity and diversity metric. We achieve lower perceptual
22 M. Jones et al.
Fig. 10: Style similarity with Style Aligned [28]. Our method Pareto dominates
both versions of Style Aligned Image Generation for image generation both on the same
category as training (left) and when evaluated on categories different from training,
e.g., trained on human portraits but tested on dog images (right).
distance to the style ground-truth images, low perceptual distance from content
images, and high diversity.
Extra Qualitative Evaluation We compare our method to non finetuning-
based methods in Figure 12. We observe that these methods perform worse
than finetuning-based methods, especially when generating images in a different
category to the training style image. Secondly, we compare our method with
the highest-performing baselines, but use our style guidance (Equation 10 ) to
apply stylization during inference for these baselines. We present our results in
Figure 13. First, we notice that using style guidance for adding style allows the
baseline methods to better preserve original content over LoRA scale (Figure
13 vs Figure 6). While adding our style guidance is better able to preserve
content while applying style for baseline methods, our full method is still able
to outperform baselines with style guidance applied.
Paired Customization 23
Fig. 11: Image diversity with Style Aligned [28] on learned style (Diversity).
Our method has high diversity and low perceptual distance to ground truth style images
both on the same category as training (left) and when evaluated on categories different
from training, e.g., trained on human portraits but tested on dog images (right) as
compared to both versions of Style Aligned Image Generation.
In this section, we derive our style guidance formulation. We consider the prob-
ability of latent x with multiple conditionings [8], i.e., the text prompt ct and a
class of style images cstyle . First, we apply Bayes’ rule:
\begin {aligned} P(\x |c_t, c_\text {style}) = \frac {P(\x , c_t, c_\text {style})}{P(c_t, c_\text {style})} = \frac {P(c_\text {style}|c_t, \x )P(c_t|\x )P(\x )}{P(c_t, c_\text {style})} \end {aligned} (13)
\begin {aligned} & \log (P(\x |c_t, c_\text {style})) \\ &= \log (P(c_\text {style}|c_t, \x )) + \log (P(c_t|\x )) + \log (P(\x )) \\ & \ \ \ \ - \log (P(c_t, c_\text {style})) \end {aligned}
(14)
\begin {aligned} & \nabla _\x \log (P(\x |c_t, c_\text {style})) \\ = & \nabla _\x \log (P(c_\text {style}|c_t, \x )) + \nabla _\x \log (P(c_t|\x )) + \nabla _\x \log (P(\x )) \\ =& \ \nabla _\x \log \left (\frac {P(c_\text {style},c_t, \x )}{P(c_t, \x )}\right ) + \nabla _\x \log \left (\frac {P(c_t,\x )}{P(\x )}\right ) + \nabla _\x \log \left (P(\x )\right ) \\ = & \ (\nabla _\x \log P(c_\text {style},c_t, \x ) - \nabla _\x \log P(c_t, \x )) \\ &+ (\nabla _\x \log P(c_t,\x ) - \nabla _\x \log P(\x )) \\ &+ (\nabla _\x \log \left (P(\x )\right )) \end {aligned} \label {eq:logmath}
(15)
As usual, we approximate ∇x plog P pct , xqq via ϵθ pxt , ct q and ∇x log pP pxqq via
ϵθ pxt , ∅q. Importantly, we approximate
\label {eq:updaterulemath} \begin {aligned} \nabla _\x \log (P(\x _t|c_t, \c _\text {style})) \approx \epsilon _{\theta _{\text {style}}}(\x _t, c_{t, \text {style}}) \end {aligned} (16)
24 M. Jones et al.
Fig. 12: Result of our method compared to the methods without finetuning (zoom in
for best viewing). For all methods, we consider adding the edgemap from the pretrained
output as an extra conditioning using ControlNet. Without this edgemap, other meth-
ods tend to lose the structure of the pretrained output. In some cases, however, an
additional edgemap can overly constrain the output of a model, like in the second and
fourth stylistic image pairs. Our method preserves the structure of the Stable Diffusion
image, while faithfully applying the desired style. We use style guidance strength 3
and classifier guidance strength 5 for our method and set the IP-adapter scale and
style-alignment scale to 0.5.
where ct is the original text prompt, cstyle is the class of stylized images from
the training style, θstyle is the UNet with style LoRA adapters applied, and
ct,style “ "tct u in <desc> style". Here, we use ct to push the prediction in the
text direction, and both text conditioning ("in <desc> style") and low-rank
adapters (θstyle ) to push the prediction into the class of images in the artist’s
Paired Customization 25
Fig. 13: Result of our method compared to the strongest baselines, but replacing
LoRA scale (Eq. 3) with our style guidance (Eq. 10) for the baselines. While our style
guidance increases baseline performance over LoRA scale images displayed in Figure
6, our method is still superior in terms of preserving content while applying style.
style denoted by cstyle . Following this, our new score estimate is:
\notag \hat {\epsilon }_\theta (\x _t, c_t, \c _\text {style}) = & \epsilon _\theta (\x _t, \varnothing ) \\ &+ \lambda _{\text {cfg}} (\epsilon _\theta (\x _t, c_t) - \epsilon _\theta (\x _t, \varnothing )) \\ \notag &+ \lambda _{\text {style}}(\epsilon _{\theta _{\text {style}}}(\x _t, c_{t, \text {style}}) - \epsilon _{\theta }(\x _t, c_t)) \\
(18)
26 M. Jones et al.
Fig. 14: Training Data. We present the synthetic training data set used for evalua-
tion, where each pair is used as a single training instance. Each column corresponds to
a different style, and each row corresponds to a different content category.
λcfg and λstyle are guidance scales that can be varied as in classifier free guid-
ance [31]. Given a fixed λcfg , we can vary the λstyle term as desired to generate
an original guidance λcfg image with varying amounts of style. Notice that at
λcfg “ λstyle , the ϵθ pxt , ∅q terms cancel and we are left with the original classifier
guidance.
C Implementation Details
Training data. We present our full training set of 20 different style transfor-
mations in Figure 14. Each image pair is a standalone training instance used in
our method. We consider four different styles (posterization, impressionist, neu-
ral painting, cartoonization), with each column corresponding to a single style.
For each style, we consider five categories for training (man, woman, dog, cat,
landscape).
Mechanical Turk details. When running Amazon Mechanical Turk, we
prompt users with an analogy-style interface. First, we provide the training pair
of images, followed by the testing content image, and two options for possible
styled examples. After viewing both images, users choose either the left or right
image. Figure 15 shows an example. Each individual user is presented with four
training examples, as in Figure 15, followed by 16 random testing examples
comparing our method with one of our baselines. We survey 75 users for each
of the 16 individual studies and use bootstrapping to obtain variance estimates.
In total, we collect 19200 user samples. For each method, we pick a stylization
hyperparameter based on Figure 4. For details, see Table 1
Paired Customization 27