Plug and Play Diffusion Feature
Plug and Play Diffusion Feature
“a photo of a bronze “A photo of a pink “A photo of a Input Real Image “A polygonal illustration “A photo of bear
Input Real Image of a cat and a bunny” cubs in the snow”
horse in a museum” horse on the beach” robot horse”
Figure 1. Given a single real-world image as input, our framework enables versatile text-guided translations of the original content. Our
results exhibit high fidelity to the input structure and scene layout, while significantly changing the perceived semantic meaning of objects
and their appearance. Our method does not require any training, but rather harnesses the power of a pre-trained text-to-image diffusion
model through its internal representation. We present new insights about deep features encoded in such models, and an effective framework
to control the generation process through simple modification of these features.
1
erated layout is to design text-to-image foundation models our method outperforms existing state-of-the-art baselines,
that explicitly incorporate additional guiding signals, such achieving significantly better balance between preserving
as user-provided masks [12, 28, 34]. For example, recently the guidance layout and deviating from its appearance.
Make-A-Scene [12] trained a text-to-image model that is
also conditioned on a label segmentation mask, defining the 2. Related Work
layout and the categories of objects in the scene. However,
such an approach requires an extensive compute as well as Image-to-image translation. Image-to-Image (I2I)
large-scale text-guidance-image training tuples, and can be translation is aimed at estimating a mapping of an image
applied at test-time to these specific types of inputs. In this from a source domain to a target domain, while preserving
paper, we are interested in a unified framework that can be the domain-invariant characteristics of the input image,
applied to versatile I2I translation tasks, where the struc- e.g., objects’ structure or scene layout. From classical
ture guidance signal ranges from artistic drawings to photo- to modern data-driven methods, numerous visual prob-
realistic images (see Fig. 1). Our method does not require lems have been formulated and tackled as an I2I task
any training or fine-tuning, but rather leverages a pre-trained (e.g., [7, 10, 17, 32, 42]). Seminal deep-learning-based
and fixed text-to-image diffusion model [36]. methods have proposed various GAN-based frameworks to
Specifically, we pose the fundamental question of how encourage the output image to comply with the distribution
structure information is internally encoded in such a model. of the target domain [23, 29, 30, 49]. Nevertheless, these
We dive into the intermediate spatial features that are methods require datasets of example images from both
formed during the generation process, empirically analyze source and target domains, and often require training
them, and devise a new framework that enables fine-grained from scratch for each translation task (e.g., horse-to-zebra,
control over the generated structure by applying simple ma- day-to-night, summer-to-winter). Other works utilize
nipulations to spatial features inside the model. Specifically, pre-trained GANs by performing the translation in its latent
spatial features and their self-attentions are extracted from space [1, 35, 45]. Several methods have also considered
the guidance image, and are directly injected into the text- the task of zero-shot I2I by training a generator on a single
guided generation process of the target image. We demon- source-target image pair example [46, 47]. With the advent
strate that our approach is not only applicable in cases of unconditional image diffusion models, several methods
where the guidance image is generated from text, but also have been proposed to adopt or extend them for various
for real-world images that are inverted into the model. I2I tasks [39, 48]. In this paper, we consider the task of
text-guided image-to-image translation where the target
Our approach of operating in the space of diffusion fea- domain is not specified through a dataset of images but
tures is related to Prompt-to-Prompt (P2P) [16], which re- rather via a target text prompt. Our method is zero-shot,
cently observed that by manipulating the cross-attention does not require training and is applicable to versatile I2I
layers, it is possible to control the relation between the spa- tasks.
tial layout of the image to each word in the text. We demon-
strate that fine-grained control over the generated layout is
difficult to achieve solely from the interaction with a text. Text-guided image manipulation. With the tremendous
Intuitively, since the cross attention is formed by the associ- progress in language-vision models, a surge of methods
ation of spatial features to words, it allows to capture rough have been proposed to perform various types of text-driven
regions at the object level, yet localized spatial information image edits. Various methods have proposed to combine
that is not expressed in the source text prompt (e.g., ob- CLIP [33], which provides a rich and powerful joint image-
ject parts) is not guaranteed to be preserved by P2P. Instead, text embedding space, with a pre-trained unconditional im-
our method focuses only on spatial features and their self- age generator, e.g., a GAN [6, 14, 26, 31] or a diffusion
affinities – we show that such features exhibit high granu- model [2, 22, 25]. For example, DiffusionCLIP [22] uses
larity of spatial information, allowing us to control the gen- CLIP to fine-tune a diffusion model to perform text guided
erated structure, while not restricting the interaction with manipulations. Concurrent to our work, [25] uses CLIP and
the text. Our method outperforms P2P in terms of structure semantic losses of [46] to guide a diffusion process to per-
preservation and is superior in working with real guidance form I2I translation. Aiming to edit the appearance of ob-
images. jects in real-world images, Text2LIVE [4] trains a gener-
ator on a single image-text pair, without additional train-
To summarize, we make the following key contributions:
ing data; thus, avoiding the trade-off, inherent to pre-trained
(i) We provide new empirical insights about internal spatial
generators, between satisfying the target edit and maintain-
features formed during the diffusion process.
ing high-fidelity to the original content. While these meth-
(ii) We introduce an effective framework that leverages the ods have demonstrated impressive text-guided semantic ed-
power of pre-trained and fixed guided diffusion, allowing its, there is still a gap between the generative prior that is
to perform high-quality text-guided I2I translation without learned solely from visual data (typically on specific do-
any training or fine-tuning. mains or ImageNet data), and the rich CLIP text-image
(iii) We show, both quantitatively and qualitatively that guiding signal that has been learned from much broader and
2
“A photo of a
statue in the snow”
DDIM
Inversion
Original denoised
Input image Feature and
self-attention
Injection
Residual Self Cross
Block Attention Attention
“A photo of a statue
in the snow” Injected features by our method
Figure 2. Plug-and-play Diffusion Features. (a) Our framework takes as input a guidance image and a text prompt describing the desired
translation; the guidance image is inverted to initial noise xG T , which is then progressively denoised using DDIM sampling. During this
process, we extract (f lt , q lt , klt ) – spatial features from the decoder layers and their self-attention, as illustrated in (b). To generate our
text-guided translated image, we fix x∗T = xG l l l
T and inject the guidance features (f t , q t , kt ) at certain layers, as discussed in Sec. 4.
richer data. Recently, text-to-image generative models have out and fulfilling the target text. We demonstrate that our
closed this gap by directly conditioning image generation method significantly outperforms SDEdit, providing better
on text during training [12, 28, 34, 36, 40]. These models balance between these two ends.
have demonstrated unprecedented capabilities in generating
high-quality and diverse images from text, capturing com- 3. Preliminary
plex visual concepts (e.g., object interactions, geometry, or
composition). Nevertheless, such models offer little con- Diffusion models [11, 18, 36, 43] are probabilistic gen-
trol over the generated content. This creates a great interest erative models in which an image is generated by progres-
in developing methods to adopt such unconstrained text-to- sively removing noise from an initial Gaussian noise image,
image models for controlled content creation. xT ∼ N (0, I). These models are founded on two comple-
mentary random processes. the forward process, in which
Several concurrent methods have taken first steps in this
Gaussian noise is progressively added to a clean image, x0 :
direction, aiming to influence different properties of the
generated content [13, 21, 38, 48]. DreamBooth [38] and √ √
xt = αt · x0 + 1 − αt · z (1)
Textual Inversion [13] share the same high-level goal of
“personalizing” a pre-trained text-to-image diffusion model where z ∼ N (0, I) and {αt } are the noise schedule.
given a few user-provided images. Our method also lever-
The backward process is aimed at gradually denoising
ages a pre-trained text-to-image diffusion model to achieve
xT , where at each step a cleaner image is obtained. This
our goal, yet does not involve any training or fine-tuning.
process is achieved by a neural network θ (xt , t) that pre-
Instead, we devise a simple framework that intervenes in
dicts the added noise z. Once trained, each step of the
the generation process by directly manipulating the spatial
backward process consists of applying θ to the current xt ,
features.
and adding a Gaussian noise perturbation to obtain a cleaner
As discussed in Sec. 1, our methodological approach is xt−1 .
related to Prompt-to-Prompt [16], yet our method offers Diffusion models are rapidly evolving and have been ex-
several key advantages: (i) enables fine-grained control over tended and trained to progressively generate images condi-
the generated shape and layout, (ii) allows to use arbitrary tioned on a guiding signal θ (xt , y, t), e.g., conditioning
text-prompts to express the target translation; in contrast to the generation on another image [39], class label [19], or
P2P that requires word-to-word alignment between a source text [22, 28, 34, 36].
and target text prompts, (iii) demonstrates superior perfor- In this work, we leverage a pre-trained text-conditioned
mance of real-world guidance images. Latent Diffusion Model (LDM), a.k.a Stable Diffusion [36],
Lastly, SDEdit [27] is another method that applies ed- in which the diffusion process is applied in the latent space
its on user provided images using free text prompts. Their of a pre-trained image autoencoder. The model is based on a
method noises the guidance image to an intermediate dif- U-Net architecture [37] conditioned on the guiding prompt
fusion step, and then denoises it conditioned on the input P . Layers of the U-Net comprise a residual block, a self-
prompt. This simple approach leads to impressive results, attention block, and a cross-attention block, as illustrated in
yet exhibit a tradeoff between preserving the guidance lay- Fig. 2 (b). The residual block convolve image features φtl−1
3
Generated images Real images
Input
layer=1
layer=4
layer=7
layer=11
Figure 3. Visualising diffusion features. We used a collection of 20 humanoid images (real and generated), and extracted spatial features
from different decoder layers, at roughly 50% of the generation process (t = 540). For each block, we applied PCA on the extracted features
across all images and visualized the top three leading components. Intermediate features (layer 4) reveal semantic regions (e.g., legs or
torso) that are shared across all images, under large variations in object appearance and the domain of images. Deeper features capture
more high-frequency information, which eventually forms the output noise predicted by the model. See SM for additional visualizations.
from the previous layer l−1 to produce intermediate features Input generation time
Generated
Given an input guidance image I G and a target prompt representing the affinities between the spatial features, al-
P , our goal is to generate a new image I ∗ that complies lows to retain fine layout and shape details.
with P and preserves the structure and semantic layout of
Based on our findings, we devise a simple framework
I G . We consider StableDiffusion [36], a state-of-the-art
that extracts features from the generation process of the
pre-trained and fixed text-to-image LDM model, denoted by
guidance image I G and directly injects them along with P
θ (xt , P, t). This model is based on a U-Net architecture,
into the generation process of I ∗ , requiring no training or
as illustrated in Fig. 2 and discussed in Sec. 3.
fine-tuning (Fig. 2). Our approach is applicable for both
Our key finding is that fine-grained control over the gen-
text-generated and real-world guidance images, for which
erated structure can be achieved by manipulating spatial
we apply DDIM inversion [44] to get the initial xG T.
features inside the model during the generation process.
Specifically, we observe and empirically demonstrate that:
(i) spatial features extracted from intermediate decoder lay- Spatial features. In text-to-image generation, one can use
ers encode localized semantic information and are less af- descriptive text prompts to specify various scene and object
fected by appearance information, and (ii) the self-attention, proprieties, including those related to their shape, pose and
4
Source image Layer 4 Layers 4-8 Layers 4-11 Input image layer=4 layer=8 layer=11
5
leaked into the generated image (e.g., shades of the red t- Algorithm 1 Plug-and-Play Diffusion Features
shirt and blue jeans are apparent in Layer 4-11). To achieve Inputs:
a better balance between preserving the structure of I G and I G . real guidance image
deviating from its appearance, we do not modify spatial fea- P . target text prompt
tures at deep layers, but rather leverage the self-attention τf , τA . injection thresholds
layers as discussed below.
xGT ← DDIM-inv(I )
G
∗ G
Self-attention. Self-attention modules compute the affini- xT ← xT . Starting from same seed
ties Alt between the spatial features after linearly project- for t ← T . . .n 1 doo
4 l
ing them into queries and keys. These affinities have a zG G
t−1 , f t , At ← θ xt , ∅, t
tight connection to the established concept of self-similarly,
xG G
t−1 ← DDIM-samp xt , z t−1
G
which has been used to design structure descriptors by both
classical and modern works [3,24,41,46]. This motivates us if t > τf then f ∗4 4 ∗4
t ← f t else f t ← ∅
∗l l ∗l
to consider the attention matrices Alt to achieve fine-grained if t > τA then At ← At n else Aot ←∅
control over the generated content. ∗ ∗ ∗4 ∗l
z t−1 ← ˆθ xt , P, t ; f t , At
Fig. 6, shows the leading principal components of a ma-
x∗t−1 ← DDIM-samp x∗t , z ∗t−1
trix Alt for a given image. As seen, in early layers, the at- end for
tention is aligned with the semantic layout of the image, Output: I ∗ ← x∗0
grouping regions according to semantic parts. Gradually,
higher-frequency information is captured.
Practically, injecting the self-attention matrix is done by
replacing the matrix Alt in Eq. 2. Intuitively, this operation from θ (xt , Pn , t). For example, using Pn that describes
pulls features close together, according to the affinities en- the guidance image, we can steer the denoised image away
from the original content.
coded in Alt . We denote this additional operation by modi-
fying Eq. (3) as follows: We use a parameter α ∈ [0, 1] to balance between neutral
and negative prompting:
z ∗t−1 = ˆθ (xt , P, t; f 4t , {Alt }) (4)
˜ = αθ (xt , ∅, t) + (1 − α)θ (xt , Pn , t) (6)
Fig. 5(b) shows the effect of Eq. (4) for increasing injec-
tion layers; the maximal injection layer of Alt controls the We plug ˜ instead of θ (xt , ∅, t) in Eq. (5). That is, =
level of fidelity to the original structure, while mitigating wθ (xt , P, t) + (1 − w)˜.
the issue of appearance leakage. Fig. 5(c) demonstrates In practice, we find negative-prompting to be beneficial
the pivotal role of the features f 4t . As seen, with only for handling textureless “primitives” guidance images (e.g.,
self-attention, i.e., z ∗t−1 = ˆθ (xt , P, t; {Alt }), there is no silhouette images). For natural-looking guidance images, it
semantic association between the original content and the plays a minor role. See Appendix A.1 for more details.
translated one, resulting in large deviations in structure.
Our plug-and-play diffusion features framework is sum- 5. Results
marized in Alg. 1, and is controlled by two parameters:
We thoroughly evaluate our method both quantitatively
(i) τf defines the sampling step t until which f 4t are in-
and qualitatively on diverse guidance image domains, both
jected. (ii) τA is the sampling step until which Alt are in- real and generated ones, as discussed below. Please see Ap-
jected. In all our results, we use a default setting where pendix C for full implementation details of our method.
self-attention is injected into all the decoder layers. The ex-
act settings of the parameters are discussed in Sec. 5.
Datasets. Our method supports versatile text-guided
image-to-image translation tasks and can be applied to arbi-
Negative-prompting. In classifier-free guidance [20], the
trary image domains. Since there is no existing benchmark
predicted noise at each sampling step is given by:
for such diverse settings, we created two new datasets: (i)
= wθ (xt , P, t) + (1−w) θ (xt , ∅, t) (5) Wild-TI2I, comprises of 148 diverse text-image pairs, 53%
of which consists of real guidance images that we gathered
where w > 1 is the guidance strength. That is, from the Web; (ii) ImageNet-R-TI2I, a benchmark we de-
is being extrapolated towards the conditional prediction rived from the ImageNet-R dataset [15], which comprises
θ (xt , P, t) and pushed away from the unconditional one of various renditions (e.g., paintings, embroidery, etc.) of
θ (xt , ∅, t). This increases the fidelity of the denoised ImageNet object classes. To adopt this dataset for our pur-
image to the prompt P , while allowing to deviate from pose, we manually selected 3 high-quality images from 10
θ (xt , ∅, t). Similarly, by replacing the empty prompt in different classes. To generate our image-text examples, we
Eq. (5) with a “negative” prompt Pn , we can push away created a list of text templates by defining for each source
6
“a photo of a “a photo of a “a photo of a
Guidance golden robot” wooden statue” sand sculpture” benchmark, for which we automatically created aligned
source-target prompts using the labels provided for the ren-
ditions and object categories. We further include quali-
tative comparison to a subset of Wild-TI2I for which the
source and target prompts are aligned. For evaluating P2P
on real guidance images, we applied DDIM inversion with
“a photo of a the source text as in [16].
golden sculpture “a photo of a “a photo of a
in a temple” wooden statue” silver robot”
Fig. 8 shows sample results of our method compared
with the baselines. As seen, our method successfully trans-
Wild TI2I-Real
lates diverse inputs, and works well for both real and gener-
ated guidance images. In all cases, our results exhibit both
high preservation to the guidance layout and high fidelity to
the target prompt. This is in contrast to SDEdit that suffers
“a photo of a “a photo of a “a photo of an
sparrow” hummingbird in eagle” from an inherent tradeoff between the two – with low noise
a forest”
level, the guidance structure is well preserved but in the ex-
panse of hardly changing the appearance; larger deviation
in appearances can be achieved with higher noise level, yet
the structure is damaged. VQGAN+CLIP exhibits the same
behavior, with overall lower image quality. Similarly, Dif-
“a sculpture of “an image of a “a sketch of a fuseIT shows high fidelity to the guiding shape, with little
a hummingbird” hummingbird” parrot”
changes to the appearance.
In comparison to P2P, it can be seen that their results on
generated guidance images (first 3 rows) depict high fidelity
ImageNet-R-TI2I
7
Source image Ours P2P DiffuseIT SDEdit .6 SDEdit .75 SDEdit .85 VQ+CLIP
“a photo of
rubber ducks
walking on
street”
Wild TI2I-Generated
“a photo of a
golden robot”
“a photo of a
car made of ice
on the grass”
“a photo of
a white
wedding cake”
Wild TI2I-Real
“a photo of a silver
robot walking
on the moon”
“a photo of
a pizza”
ImageNet-R-TI2I
“an origami
of a panda”
“a tattoo of
a hummingbird”
Figure 8. Comparisons. Sample results are shown for each of the two benchmarks: ImageNet-R-TI2I and Wild-TI2I, which includes
both real and generated guidance images. From left to right: the guidance image and text prompt, our results, P2P [16], DiffuseIT [25],
SDedit [27] with 3 different noising levels, and VQ+CLIP [9].
with the target text. However, P2P often results in large Additional baselines. Fig. 11 shows qualitative compar-
deviations from the guidance structure, especially in cases isons with: (i) Text2LIVE [4], (ii) DiffusionCLIP [22], and
where multiple prompts edits are applied (last two rows in (iii) FlexIT [8]. These methods either fail to deviate from
Fig. 10). Our method demonstrates fine-grained structure the guidance image or result in noticeable visual artifacts.
preservation across all these examples, while successfully Since Text2LIVE is designed for layered textural editing,
translating multiple traits (e.g., both category and style). thus it can only “paint” over the guidance image and cannot
These properties are strongly evident by Fig. 9, where our apply any structural changes necessary to convey the target
method results in a significantly lower self-similarity dis- edit (e.g. dog to venom, church to Asian tower). More-
tance, even compared to P2P with injecting cross-attention over, Text2LIVE does not leverage a strong generative prior,
at all timesteps (t = 1000). hence often results in low visual quality. FlexIT often fails
8
Generated images
Real images (a) Wild-TI2I (b) ImageNet-R-TI2I (c) Generated ImageNet-R-TI2I
VQGAN P2P (t=500)
Ours -CLIP SDEdit (n=0.85) VQGAN
SDEdit (n=0.85) VQGAN
P2P
Ours (full)
Ours
SDEdit (n=0.6) SDEdit (n=0.6) Ours
SDEdit (n=0.6)
DiffuseIT
DiffuseIT Better Better Better
DiffuseIT
Text-Image Similarity (CLIP cosine similarity) Text-Image Similarity (CLIP cosine similarity) Text-Image Similarity (CLIP cosine similarity)
Figure 9. Quantitative evaluation. We measure CLIP cosine similarity (higher is better) and DINO-ViT self-similarity distance (lower is
better) to quantify the fidelity to text and preservation of structure, respectively. We report these metrics on three benchmarks: (a) Wild-TI2I
for which an ablation of our method is included, (b) ImageNet-R-TI2I, and (c) Generated-ImageNet-R-TI2I. Note that we could compare
to P2P only for (b) and (c) due to their prompts restriction. All baselines struggle to achieve both low structure distance and a high CLIP
score. Our method exhibit a better balance between these two ends across all benchmarks.
9
Guidance Text2Live DiffusionCLIP FlexIT Ours References
“a photo
of [1] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle:
Venom”
A residual-based stylegan encoder via iterative refinement.
In Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 2021. 2
“a photo
of [2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
a bear”
diffusion for text-driven editing of natural images. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
“a photo and Pattern Recognition, 2022. 2
of a
yorkshire [3] Shai Bagon, Ori Brostovski, Meirav Galun, and Michal Irani.
terrier”
Detecting and sketching the common. In 2010 IEEE Com-
puter Society Conference on Computer Vision and Pattern
“a photo
of an Recognition, 2010. 6
ancient
Asian [4] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas-
tower”
ten, and Tali Dekel. Text2live: Text-driven layered image
and video editing. In European Conference on Computer
“a photo
of a Vision. Springer, 2022. 2, 7, 8, 10
wooden
house ” [5] Dmitry Baranchuk, Andrey Voynov, Ivan Rubachev,
Valentin Khrulkov, and Artem Babenko. Label-efficient se-
“a photo mantic segmentation with diffusion models. In International
of a Conference on Learning Representations, 2022. 5
golden
church” [6] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park,
Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by
word. arXiv preprint arXiv:2103.10951, 2021. 2
Figure 11. Qualitative comparisons to additional baselines:
[7] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and
Text2LIVE [4], DiffusionCLIP [22], FlexIT [8]. These methods
Shi-Min Hu. Sketch2photo: internet image montage. ACM
fail to deviate from the structure for matching the target prompt,
Trans. Graph., 2009. 2
or create undesired artifacts.
[8] Guillaume Couairon, Asya Grechka, Jakob Verbeek, Hol-
ger Schwenk, and Matthieu Cord. FlexIT: Towards flexi-
ble semantic image translation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2022. 7, 8, 10
[9] Katherine Crowson, Stella Biderman, Daniel Kornis,
Guidance Image Ours Guidance Image Ours
Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed-
ward Raff. Vqgan-clip: Open domain image generation and
Figure 12. Limitations. Our method fails when there is no seman-
editing with natural language guidance. In European Con-
tic association between the guidance content and the target text.
ference on Computer Vision, pages 88–105. Springer, 2022.
Thus, it does not perform well on solid segmentation masks with
7, 8
arbitrary colors.
[10] Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, and
William T Freeman. Sparse, smart contours to represent and
which case some appearance information would leak into edit images. In Proceedings of the IEEE Conference on Com-
our results. We believe that our work demonstrates the yet puter Vision and Pattern Recognition, 2018. 2
unrealized potential of the rich and powerful feature space [11] Prafulla Dhariwal and Alexander Nichol. Diffusion models
spanned by pre-trained text-to-image diffusion models. We beat gans on image synthesis. Advances in Neural Informa-
hope it will motivate future research in this direction. tion Processing Systems, 2021. 3
[12] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
Acknowledgments: We thank Omer Bar-Tal for his in- Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-
sightful comments and discussion. This project re- based text-to-image generation with human priors. In Eu-
ceived funding from the Israeli Science Foundation (grant ropean Conference on Computer Vision (ECCV), 2022. 1, 2,
3
2303/20), the Carolito Stiftung, and the NVIDIA Applied
Research Accelerator Program. Dr. Bagon is a Robin [13] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
Chemers Neustein AI Fellow.
Or. An image is worth one word: Personalizing text-to-
image generation using textual inversion. arXiv preprint
arXiv:2208.01618, 2022. 3
[14] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and
Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adap-
tation of image generators. ACM Transactions on Graphics
(TOG), 2022. 2
10
[15] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- translation. In European Conference on Computer Vision,
vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, 2020. 2
Samyak Parajuli, Mike Guo, et al. The many faces of robust- [30] Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli
ness: A critical analysis of out-of-distribution generalization. Shechtman, Alexei Efros, and Richard Zhang. Swapping au-
In ICCV, 2021. 6 toencoder for deep image manipulation. Advances in Neural
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Information Processing Systems, 2020. 2
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- [31] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
age editing with cross attention control. arXiv preprint and Dani Lischinski. Styleclip: Text-driven manipulation of
arXiv:2208.01626, 2022. 2, 3, 7, 8 stylegan imagery. In Proceedings of the IEEE/CVF Interna-
[17] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian tional Conference on Computer Vision (ICCV), 2021. 2
Curless, and David Salesin. Image analogies. In Lynn [32] Lara Raad and Bruno Galerne. Efros and freeman image
Pocock, editor, ACM Trans. on Graphics (Proceedings of quilting algorithm for texture synthesis. Image Process. Line,
ACM SIGGRAPH), 2001. 2 2017. 2
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
sion probabilistic models. Advances in Neural Information Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Processing Systems, 2020. 3 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[19] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Krueger, and Ilya Sutskever. Learning transferable visual
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models from natural language supervision. In Proceedings
models for high fidelity image generation. J. Mach. Learn. the International Conference on Machine Learning (ICML),
Res., 23:47–1, 2022. 3 2021. 2
[20] Jonathan Ho and Tim Salimans. Classifier-free diffusion [34] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
guidance. In NeurIPS 2021 Workshop on Deep Generative and Mark Chen. Hierarchical text-conditional image gen-
Models and Downstream Applications, 2021. 6 eration with clip latents. arXiv preprint arXiv:2204.06125,
[21] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen 2022. 1, 2, 3
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: [35] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
Text-based real image editing with diffusion models. arXiv Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
preprint arXiv:2210.09276, 2022. 3 in style: a stylegan encoder for image-to-image translation.
[22] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- In Proceedings of the IEEE/CVF conference on computer vi-
fusionclip: Text-guided diffusion models for robust image sion and pattern recognition, 2021. 2
manipulation. In Proceedings of the IEEE/CVF Conference [36] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
on Computer Vision and Pattern Recognition, 2022. 2, 3, 7, Patrick Esser, and Björn Ommer. High-resolution image syn-
8, 10 thesis with latent diffusion models, 2021. 1, 2, 3, 4
[23] Kunhee Kim, Sanghun Park, Eunyeong Jeon, Taehun Kim, [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
and Daijin Kim. A style-aware discriminator for controllable net: Convolutional networks for biomedical image segmen-
image translation. In Proceedings of the IEEE/CVF Confer- tation. In International Conference on Medical image com-
ence on Computer Vision and Pattern Recognition (CVPR), puting and computer-assisted intervention, 2015. 3
2022. 2 [38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[24] Nicholas Kolkin, Jason Salavon, and Gregory Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Shakhnarovich. Style transfer by relaxed optimal transport tuning text-to-image diffusion models for subject-driven
and self-similarity. In Proceedings of the IEEE/CVF Con- generation. arXiv preprint arXiv:2208.12242, 2022. 3
ference on Computer Vision and Pattern Recognition, 2019. [39] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
6 Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
[25] Gihyun Kwon and Jong Chul Ye. Diffusion-based image Norouzi. Palette: Image-to-image diffusion models. In ACM
translation using disentangled style and content representa- SIGGRAPH 2022 Conference Proceedings, 2022. 2, 3
tion. arXiv preprint arXiv:2209.15264, 2022. 2, 7, 8 [40] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[26] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
text-to-image generation with improved clip+ gan space op- Rapha Gontijo Lopes, et al. Photorealistic text-to-image
timization. arXiv preprint arXiv:2112.01573, 2021. 2 diffusion models with deep language understanding. arXiv
[27] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- preprint arXiv:2205.11487, 2022. 1, 3
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and [41] Eli Shechtman and Michal Irani. Matching local self-
editing with stochastic differential equations. arXiv preprint similarities across images and videos. In IEEE Conference
arXiv:2108.01073, 2021. 3, 7, 8 on Computer Vision and Pattern Recognition, 2007. 6
[28] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [42] Yichang Shih, Sylvain Paris, Frédo Durand, and William T.
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Freeman. Data-driven hallucination of different times of day
Mark Chen. Glide: Towards photorealistic image generation from a single outdoor photo. ACM Trans. Graph., 2013. 2
and editing with text-guided diffusion models. arXiv preprint [43] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
arXiv:2112.10741, 2021. 2, 3 and Surya Ganguli. Deep unsupervised learning using
[29] Taesung Park, Alexei A. Efros, Richard Zhang, and Jun- nonequilibrium thermodynamics. In International Confer-
Yan Zhu. Contrastive learning for unpaired image-to-image ence on Machine Learning. PMLR, 2015. 3
11
[44] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In International Conference
on Learning Representations, 2020. 4, 5
[45] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
Daniel Cohen-Or. Designing an encoder for stylegan im-
age manipulation. ACM Transactions on Graphics (TOG),
40(4):1–14, 2021. 2
[46] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali
Dekel. Splicing vit features for semantic appearance transfer.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2022. 2, 6, 7
[47] Yael Vinker, Eliahu Horwitz, Nir Zabari, and Yedid Hoshen.
Image shape manipulation from a single augmented training
sample. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 13769–13778, 2021. 2
[48] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
Chen, Qifeng Chen, and Fang Wen. Pretraining is all you
need for image-to-image translation. In arXiv, 2022. 2, 3
[49] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision, 2017. 2
12
A. Ablations No Negative Negative
Guidance Prompting Prompting
A.1. Negative-prompting.
We qualitatively and quantitatively ablate the effect “a photo
of negative prompting (see Sec. 4 of the main paper). of a
humming
Tab. 1 compares our metrics w/ and w/o negative prompt- -bird”
ing (bottom row and second row), using our Wild-TI2I and
ImageNet-R-TI2I benchmarks. The results indicate that the
usage of negative prompting (bottom row) leads to slightly “a
larger deviation from the guidance image (higher LPIPS photoreal
istic
distance between I G and I ∗ ), while introducing only a mi- image of
a wooden
nor reduction in structure preservation. Sample results of sculpture”
this ablation are shown in Fig. 13, where we can also notice
that negative prompting has a larger effect for “primitive im-
ages”, i.e, simple “textureless” images such as silhouettes
“an
(top two rows) than natural guidance images. image of
Kung Fu
Panda”
A.2. Injected features.
Our method injects features to the decoder block, in a
specific layer which we observed to capture localized se- Figure 13. Qualitative ablation of negative prompting. The effect
mantic information. To complete our analysis, we extend of negative prompting is most meaningful on textureless guidance
our PCA feature visualization to include both the decoder images. In the case of realistic images (row 3) it has a minor effect.
and encoder features. As seen in Fig. 15, the encoder re-
semble a mirrored trend to the decoder: the encoder features
start with high frequency noise (layer 1), which is grad- those generated under the same prompt P with different
ually transformed into cleaner features that depict lower- seeds {xiT }. Specifically, we used 10 different prompts and
frequency content throughout the layers. Nevertheless, lo- 10 different seeds to generate 100 images, using all combi-
calized semantic information is overall less apparent in the nations. We considered the images generated under: (i) the
encoder’s features. To numerically evaluate this, we con- same prompt across different seeds, and (ii) the same seed
sider a modified version of our method where the encoder across different prompts. Total of 20 sets of 10 images each.
features from layer 7, which resemble some semantic in- We then measured the variance between the feature maps
formation, are additionally injected. As seen in Tab. 1, this within each of these sets. In Fig. 14 (top), we report these
combination results in worse CLIP score in all data-sets and variances (averaged across spatial location) as a function
smaller LPIPS deviation from the guidance image on most of the encoder layer l. As seen, changing the initial seed,
sets (first row). for any fixed prompt, results in significantly higher variance
across features for all layers l compared to fixing the seed
B. Initial noise xT and spatial features and changing only the prompt. These findings validate our
hypothesis and support our method’s dependency on the ini-
We observed that in order for our method to work, the tial seed.
initial noise used to generate the translated image x∗T has to
match the initial guidance noise xG T . Since we inject fea-
tures into the decoder from the very first step of the back-
C. Implementation Details
ward process, this dependency on the random seed can only We use Stable Diffusion as our pre-trained text-to-image
be explained by the encoder features at t = T , denoted by model; we use the StableDiffusion-v-1-4 checkpoint pro-
f et l . Recall that these features depend on both x∗T and the vided via official HuggingFace webpage.
target prompt P . This raises the question: why f Tel origi- In all of our experiments, we use DDIM deterministic
nated from x∗T = xG T and an arbitrary text prompt P , allow sampling with 50 steps. In the case of real guidance im-
our method to work? We hypothesize that in t = T the target ages, we perform deterministic DDIM inversion with 1000
prompt has little effect on the encoder features f eTl , thus the forward steps and then perform deterministic DDIM sam-
injected decoder features f 4T comply with the encoder fea- pling with 1000 backward steps. Our translation results are
tures. In contrast, changing the seed results in a mismatch performed with 50 sampling steps, thus we extract features
between f 4T and f Tel . This may be surprising since images only at these steps. We set our default injection thresholds
generated from the same seed under different text prompts to: τA = 25, τf = 40 out of the 50 sampling steps; for
may dramatically differ from one another (see Fig. 14 bot- primitive guidance image, we found that τA = τf = 25 to
tom). work better.
To validate this hypothesis, we performed an analysis During translation, we set the classifier-free guidance
that shows that features formed from the same xT under ar- scale for real and generated guidance images to 15.0 and
bitrary prompts {Pi } are significantly more correlated than 7.5, respectively. The use of negative prompting is con-
13
Wild-TI2I Real Wild-TI2I Generated ImageNet-R-TI2I
Self-Sim ↓ CLIP ↑ LPIPS ↑ Self-Sim ↓ CLIP ↑ LPIPS ↑ Self-Sim ↓ CLIP ↑ LPIPS ↑
w/ encoder-feat-7 0.058 0.280 0.527 0.035 0.264 0.453 0.05 0.273 0.458
w/o neg. prompt 0.052 0.281 0.490 0.033 0.275 0.441 0.048 0.274 0.451
w/o feat. 0.090 0.288 0.584 0.084 0.297 0.633 0.076 0.281 0.534
w/o self-attn. 0.097 0.286 0.597 0.090 0.295 0.657 0.089 0.278 0.564
Our method 0.058 0.282 0.521 0.048 0.289 0.542 0.051 0.275 0.462
Table 1. Quantitative evaluation on WILD-Real benchmark. We evaluate the distance in DINO-ViT self-similarity for structure
preservation, CLIP score for target text faithfulness and LPIPS distance for deviation from the guidance image. We ablate the features
injection, self-attention injection, negative prompting, and additional feature injection in the encoder blocks. We report these scores on our
three text guided I2I benchmarks. The configuration reported in the main paper (encoder features + self-attention injection and negative
prompting) is the best balance between the three metrics across the datasets.
Same prompt, realistic generated images, and for generated images that
diffrent seeds
Different prompt,
are primitive/textureless, we use an exponential scheduler
same seed α(t) = e−6·t .
For running the competitors, we use their official imple-
mentations: Prompt-to-Prompt, DiffuseIT, DiffusionCLIP,
Text2LIVE, FlexIT. For running SDEdit on StableDiffu-
sion, we use the implementation available in the Stable Dif-
fusion official repo. For running VQGAN-CLIP, we used
the publicly available repo.
Input Encoder features Bottleneck
14
Generated images Real images
Input
layer=1
layer=4
Encoder
layer=7
layer=11
layer=1
layer=4
Decoder
layer=7
layer=11
Figure 15. Visualizing diffusion features for both encoder and decoder. Extending the visualization of Fig. 3 in the main paper to include
features from encoder blocks of the U-Net at time t = 540 (top part).
15