0% found this document useful (0 votes)

26 views15 pages

Plug and Play Diffusion Feature

This paper presents a novel framework for text-driven image-to-image translation that utilizes a pre-trained text-to-image diffusion model to generate new images based on a guidance image and a target text prompt. The method allows for fine-grained control over the generated structure and semantics without requiring any training or fine-tuning, thus enabling high-quality translations across various tasks. The authors demonstrate that their approach outperforms existing methods by preserving the layout of the guidance image while altering its appearance according to the text prompt.

Uploaded by

cxy010728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views15 pages

Plug and Play Diffusion Feature

Uploaded by

cxy010728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Narek Tumanyan∗ Michal Geyer∗ Shai Bagon Tali Dekel

Weizmann Institute of Science
*Indicates equal contribution.
Project webpage: https://pnp-diffusion.github.io
arXiv:2211.12572v1 [cs.CV] 22 Nov 2022

“a photo of a bronze “A photo of a pink “A photo of a Input Real Image “A polygonal illustration “A photo of bear
Input Real Image of a cat and a bunny” cubs in the snow”
horse in a museum” horse on the beach” robot horse”

“A wooden sculpture “A cartoon of “a photo of “A polygonal illustartion “A photo of sharks

Input Real Image of a couple dancing” a couple dancing” robots dancing”
Input Generated Image of fish in the ocean” in the ocean”

Figure 1. Given a single real-world image as input, our framework enables versatile text-guided translations of the original content. Our
results exhibit high fidelity to the input structure and scene layout, while significantly changing the perceived semantic meaning of objects
and their appearance. Our method does not require any training, but rather harnesses the power of a pre-trained text-to-image diffusion
model through its internal representation. We present new insights about deep features encoded in such models, and an effective framework
to control the generation process through simple modification of these features.

Abstract and appearance of objects in a given image, and modifica-

tions of global qualities such as lighting and color.
Large-scale text-to-image generative models have been
a revolutionary breakthrough in the evolution of generative
AI, allowing us to synthesize diverse images that convey 1. Introduction
highly complex visual concepts. However, a pivotal chal-
lenge in leveraging such models for real-world content cre- With the rise of text-to-image foundation models –
ation tasks is providing users with control over the gener- billion-parameter models trained on a massive amount of
ated content. In this paper, we present a new framework text-image data, it seems that we can translate our imagina-
that takes text-to-image synthesis to the realm of image-to- tion into high-quality images through text [12, 34, 36, 40].
image translation – given a guidance image and a target While such foundation models unlock a new world of cre-
text prompt as input, our method harnesses the power of a ative processes in content creation, their power and expres-
pre-trained text-to-image diffusion model to generate a new sivity come at the expense of user controllability, which is
image that complies with the target text, while preserving largely restricted to guiding the generation solely through
the semantic layout of the guidance image. Specifically, we an input text. In this paper, we focus on attaining con-
observe and empirically demonstrate that fine-grained control over the generated structure and semantic layout of the
trol over the generated structure can be achieved by manip- scene – an imperative component in various real-world con-
ulating spatial features and their self-attention inside the tent creation tasks, ranging from visual branding and mar-
model. This results in a simple and effective approach, keting to digital art. That is, our goal is to take text-to-image
where features extracted from the guidance image are di- generation to the realm of text-guided Image-to-Image (I2I)
rectly injected into the generation process of the translated translation, where an input image guides the layout (e.g., the
image, requiring no training or fine-tuning. We demonstrate structure of the horse in Fig. 1), and the text guides the per-
high-quality results on versatile text-guided image transla- ceived semantics and appearance of the scene (e.g., “robot
tion tasks, including translating sketches, rough drawings horse” in Fig. 1).
and animations into realistic images, changing of the class A possible approach for achieving control of the gen-

1
erated layout is to design text-to-image foundation models our method outperforms existing state-of-the-art baselines,
that explicitly incorporate additional guiding signals, such achieving significantly better balance between preserving
as user-provided masks [12, 28, 34]. For example, recently the guidance layout and deviating from its appearance.
Make-A-Scene [12] trained a text-to-image model that is
also conditioned on a label segmentation mask, defining the 2. Related Work
layout and the categories of objects in the scene. However,
such an approach requires an extensive compute as well as Image-to-image translation. Image-to-Image (I2I)
large-scale text-guidance-image training tuples, and can be translation is aimed at estimating a mapping of an image
applied at test-time to these specific types of inputs. In this from a source domain to a target domain, while preserving
paper, we are interested in a unified framework that can be the domain-invariant characteristics of the input image,
applied to versatile I2I translation tasks, where the struc- e.g., objects’ structure or scene layout. From classical
ture guidance signal ranges from artistic drawings to photo- to modern data-driven methods, numerous visual prob-
realistic images (see Fig. 1). Our method does not require lems have been formulated and tackled as an I2I task
any training or fine-tuning, but rather leverages a pre-trained (e.g., [7, 10, 17, 32, 42]). Seminal deep-learning-based
and fixed text-to-image diffusion model [36]. methods have proposed various GAN-based frameworks to
Specifically, we pose the fundamental question of how encourage the output image to comply with the distribution
structure information is internally encoded in such a model. of the target domain [23, 29, 30, 49]. Nevertheless, these
We dive into the intermediate spatial features that are methods require datasets of example images from both
formed during the generation process, empirically analyze source and target domains, and often require training
them, and devise a new framework that enables fine-grained from scratch for each translation task (e.g., horse-to-zebra,
control over the generated structure by applying simple ma- day-to-night, summer-to-winter). Other works utilize
nipulations to spatial features inside the model. Specifically, pre-trained GANs by performing the translation in its latent
spatial features and their self-attentions are extracted from space [1, 35, 45]. Several methods have also considered
the guidance image, and are directly injected into the text- the task of zero-shot I2I by training a generator on a single
guided generation process of the target image. We demon- source-target image pair example [46, 47]. With the advent
strate that our approach is not only applicable in cases of unconditional image diffusion models, several methods
where the guidance image is generated from text, but also have been proposed to adopt or extend them for various
for real-world images that are inverted into the model. I2I tasks [39, 48]. In this paper, we consider the task of
text-guided image-to-image translation where the target
Our approach of operating in the space of diffusion fea- domain is not specified through a dataset of images but
tures is related to Prompt-to-Prompt (P2P) [16], which re- rather via a target text prompt. Our method is zero-shot,
cently observed that by manipulating the cross-attention does not require training and is applicable to versatile I2I
layers, it is possible to control the relation between the spa- tasks.
tial layout of the image to each word in the text. We demon-
strate that fine-grained control over the generated layout is
difficult to achieve solely from the interaction with a text. Text-guided image manipulation. With the tremendous
Intuitively, since the cross attention is formed by the associ- progress in language-vision models, a surge of methods
ation of spatial features to words, it allows to capture rough have been proposed to perform various types of text-driven
regions at the object level, yet localized spatial information image edits. Various methods have proposed to combine
that is not expressed in the source text prompt (e.g., ob- CLIP [33], which provides a rich and powerful joint image-
ject parts) is not guaranteed to be preserved by P2P. Instead, text embedding space, with a pre-trained unconditional im-
our method focuses only on spatial features and their self- age generator, e.g., a GAN [6, 14, 26, 31] or a diffusion
affinities – we show that such features exhibit high granu- model [2, 22, 25]. For example, DiffusionCLIP [22] uses
larity of spatial information, allowing us to control the gen- CLIP to fine-tune a diffusion model to perform text guided
erated structure, while not restricting the interaction with manipulations. Concurrent to our work, [25] uses CLIP and
the text. Our method outperforms P2P in terms of structure semantic losses of [46] to guide a diffusion process to per-
preservation and is superior in working with real guidance form I2I translation. Aiming to edit the appearance of ob-
images. jects in real-world images, Text2LIVE [4] trains a gener-
ator on a single image-text pair, without additional train-
To summarize, we make the following key contributions:
ing data; thus, avoiding the trade-off, inherent to pre-trained
(i) We provide new empirical insights about internal spatial
generators, between satisfying the target edit and maintain-
features formed during the diffusion process.
ing high-fidelity to the original content. While these meth-
(ii) We introduce an effective framework that leverages the ods have demonstrated impressive text-guided semantic ed-
power of pre-trained and fixed guided diffusion, allowing its, there is still a gap between the generative prior that is
to perform high-quality text-guided I2I translation without learned solely from visual data (typically on specific do-
any training or fine-tuning. mains or ImageNet data), and the rich CLIP text-image
(iii) We show, both quantitatively and qualitatively that guiding signal that has been learned from much broader and

2
“A photo of a
statue in the snow”
DDIM
Inversion

Original denoised
Input image Feature and
self-attention
Injection
Residual Self Cross
Block Attention Attention

“A photo of a statue
in the snow” Injected features by our method

Input edit text

Translated denoised

(a) Plug-and-Play Diffusion Features (b) U-Net Layer

Figure 2. Plug-and-play Diffusion Features. (a) Our framework takes as input a guidance image and a text prompt describing the desired
translation; the guidance image is inverted to initial noise xG T , which is then progressively denoised using DDIM sampling. During this
process, we extract (f lt , q lt , klt ) – spatial features from the decoder layers and their self-attention, as illustrated in (b). To generate our
text-guided translated image, we fix x∗T = xG l l l
T and inject the guidance features (f t , q t , kt ) at certain layers, as discussed in Sec. 4.

richer data. Recently, text-to-image generative models have out and fulfilling the target text. We demonstrate that our
closed this gap by directly conditioning image generation method significantly outperforms SDEdit, providing better
on text during training [12, 28, 34, 36, 40]. These models balance between these two ends.
have demonstrated unprecedented capabilities in generating
high-quality and diverse images from text, capturing com- 3. Preliminary
plex visual concepts (e.g., object interactions, geometry, or
composition). Nevertheless, such models offer little con- Diffusion models [11, 18, 36, 43] are probabilistic gen-
trol over the generated content. This creates a great interest erative models in which an image is generated by progres-
in developing methods to adopt such unconstrained text-to- sively removing noise from an initial Gaussian noise image,
image models for controlled content creation. xT ∼ N (0, I). These models are founded on two comple-
mentary random processes. the forward process, in which
Several concurrent methods have taken first steps in this
Gaussian noise is progressively added to a clean image, x0 :
direction, aiming to influence different properties of the
generated content [13, 21, 38, 48]. DreamBooth [38] and √ √
xt = αt · x0 + 1 − αt · z (1)
Textual Inversion [13] share the same high-level goal of
“personalizing” a pre-trained text-to-image diffusion model where z ∼ N (0, I) and {αt } are the noise schedule.
given a few user-provided images. Our method also lever-
The backward process is aimed at gradually denoising
ages a pre-trained text-to-image diffusion model to achieve
xT , where at each step a cleaner image is obtained. This
our goal, yet does not involve any training or fine-tuning.
process is achieved by a neural network θ (xt , t) that pre-
Instead, we devise a simple framework that intervenes in
dicts the added noise z. Once trained, each step of the
the generation process by directly manipulating the spatial
backward process consists of applying θ to the current xt ,
features.
and adding a Gaussian noise perturbation to obtain a cleaner
As discussed in Sec. 1, our methodological approach is xt−1 .
related to Prompt-to-Prompt [16], yet our method offers Diffusion models are rapidly evolving and have been ex-
several key advantages: (i) enables fine-grained control over tended and trained to progressively generate images condi-
the generated shape and layout, (ii) allows to use arbitrary tioned on a guiding signal θ (xt , y, t), e.g., conditioning
text-prompts to express the target translation; in contrast to the generation on another image [39], class label [19], or
P2P that requires word-to-word alignment between a source text [22, 28, 34, 36].
and target text prompts, (iii) demonstrates superior perfor- In this work, we leverage a pre-trained text-conditioned
mance of real-world guidance images. Latent Diffusion Model (LDM), a.k.a Stable Diffusion [36],
Lastly, SDEdit [27] is another method that applies ed- in which the diffusion process is applied in the latent space
its on user provided images using free text prompts. Their of a pre-trained image autoencoder. The model is based on a
method noises the guidance image to an intermediate dif- U-Net architecture [37] conditioned on the guiding prompt
fusion step, and then denoises it conditioned on the input P . Layers of the U-Net comprise a residual block, a self-
prompt. This simple approach leads to impressive results, attention block, and a cross-attention block, as illustrated in
yet exhibit a tradeoff between preserving the guidance lay- Fig. 2 (b). The residual block convolve image features φtl−1

3
Generated images Real images
Input
layer=1
layer=4
layer=7
layer=11

Figure 3. Visualising diffusion features. We used a collection of 20 humanoid images (real and generated), and extracted spatial features
from different decoder layers, at roughly 50% of the generation process (t = 540). For each block, we applied PCA on the extracted features
across all images and visualized the top three leading components. Intermediate features (layer 4) reveal semantic regions (e.g., legs or
torso) that are shared across all images, under large variations in object appearance and the domain of images. Deeper features capture
more high-frequency information, which eventually forms the output noise predicted by the model. See SM for additional visualizations.

from the previous layer l−1 to produce intermediate features Input generation time
Generated

f lt . In the self-attention block, features are projected into

queries, q lt , keys, klt , and values, v lt , and the output of the
block is given by:
l T

fˆt = Alt v lt where Alt = Softmax q lt klt (2)
Real

This operation allows for long-range interactions between

image features. Finally, cross-attention is computed be-
tween the spatial image features and the token embedding Figure 4. Diffusion features over generation time-steps. Visualiz-
of the text prompt P . ing PCA of spatial features of layer l = 4 for the humanoid images
(Fig. 3). Semantic parts are shared (have similar colors) across
4. Method images at each time step.

Given an input guidance image I G and a target prompt representing the affinities between the spatial features, al-
P , our goal is to generate a new image I ∗ that complies lows to retain fine layout and shape details.
with P and preserves the structure and semantic layout of
Based on our findings, we devise a simple framework
I G . We consider StableDiffusion [36], a state-of-the-art
that extracts features from the generation process of the
pre-trained and fixed text-to-image LDM model, denoted by
guidance image I G and directly injects them along with P
θ (xt , P, t). This model is based on a U-Net architecture,
into the generation process of I ∗ , requiring no training or
as illustrated in Fig. 2 and discussed in Sec. 3.
fine-tuning (Fig. 2). Our approach is applicable for both
Our key finding is that fine-grained control over the gen-
text-generated and real-world guidance images, for which
erated structure can be achieved by manipulating spatial
we apply DDIM inversion [44] to get the initial xG T.
features inside the model during the generation process.
Specifically, we observe and empirically demonstrate that:
(i) spatial features extracted from intermediate decoder lay- Spatial features. In text-to-image generation, one can use
ers encode localized semantic information and are less af- descriptive text prompts to specify various scene and object
fected by appearance information, and (ii) the self-attention, proprieties, including those related to their shape, pose and

4
Source image Layer 4 Layers 4-8 Layers 4-11 Input image layer=4 layer=8 layer=11

“a photo of a silver (a) Feature injection, no attention injection

robot in the snow”
Figure 6. Self-attention visualization. Showing 3 leading compo-
nents of the self-attention matrix Alt computed for the input image
for three different layers. The principal components are aligned
with the layout of the image: similar regions share similar colors.
Note how all pixels of the pants share similar color, despite their
(b) Features Layer 4 + Attention injection different appearance in the input image.

and a single time step. As seen, the coarsest and shallowest

layer is mostly dominated by foreground-background sep-
aration, depicting only a crude blob in the location of the
foreground object. Interestingly, we can observe that the
(c) Attention injection, no feature injection
intermediate features (layer 4) encode localized semantic
Figure 5. Ablating features and attention injection. (a) Features information shared across objects from different domains
extracted from the guidance image (left) are injected into the gen- and under significant appearance variations – similar object
eration process of the translated image (guided by a given text parts (e.g., legs, torso, head) are depicted in similar colors
prompt). While features at intermediate layers (Layer 4) exhibit across all images (layer=4 row in Fig. 3). These proper-
localized semantic information (Fig. 3), solely injecting these fea- ties are consistent across the generation process as shown
tures is insufficient for retaining the guidance structure. Incorpo- in Fig. 4. As we go deeper into the network, the features
rating deeper (and higher resolution) features leads to better struc- gradually capture more high-frequency low-level informa-
ture preservation, but results in appearance leakage from the guid- tion which eventually forms the output noise predicted by
ance image to the generated one (Layers 4-11). (b) Injecting fea-
the network. Extended feature visualizations can be found
tures only at layer 4 and self-attention maps at higher-resolution
layers alleviates this issue. (c) Injecting only self-attention maps in the Supplementary Materials (SM) on our website.
restricts the affinities between the features, yet there is no semantic
association between the guidance features and the generated ones, Feature injection. Based on these observations, we now
resulting in misaligned structure. The result of our final configu- turn to the translation task. Let xG T be the initial noise, ob-
ration is highlighted in orange. tained by inverting I G using DDIM [44].
Given the target prompt P , the generation of the trans-
scene layout, e.g., “a photo of a horse galloping in the for- lated image I ∗ is carried with the same initial noise, i.e.,
est”. However, the exact scene layout, the shape of the ob- x∗T = xG T ; we refer the reader to Appendix B for an analy-
ject and its fine-grained pose often significantly vary across sis and justification of this design choice.
generated images from the same prompt under different ini- At each step t of the backward process, we extract the
tial noise xT . This suggests that the diffusion process it- guidance features {f lt } from the denoising step: z G t−1 =
self and the resulting spatial features have a role in form- θ (xG , ∅, t). 1
These features are then injected into the gen-
t
ing such fine-grained spatial information. This hypothesis eration of I ∗ , i.e., in the denoising step of x∗t , we override
is strengthened by [5], which demonstrated that semantic
the resulting features {f ∗l l
t } with {f t }. This operation is
part segments can be estimated from spatial features in an
expressed by:
unconditional diffusion model.
We opt to gain a better understanding of how such se- z ∗t−1 = ˆθ (x∗t , P, t ; {f lt }) (3)
mantic spatial information is internally encoded in θ . To
this end, we perform a simple PCA analysis which allows where we use ˆθ (· ; {f lt }) to denote the modified denoising
us to reason about the visual properties dominating the high-
dimensional features in θ . Specifically, we generated a di- step with the injected features {f lt }. In case of no injection,
verse set of images containing various humanoids in differ- ˆθ (xt , P, t ; ∅) = θ (xt , P, t).
ent styles, including both real and text-generated images; Fig. 5(a) shows the effect of injecting spatial features f lt
sample images are shown in Fig. 3. For each image, we ex- at increasing layers l. As seen, injecting features only at
tract features f lt from each layer of the decoder at each time layer l = 4 is insufficient for preserving the structure of the
guidance image. As we inject features in deeper layers, the
step t, as illustrated in Fig. 2(b). We then apply PCA on f lt
structure is better preserved, yet appearance information is
across all images.
Fig. 3 shows the first three principal components for a 1 In the case of a generated guidance image, z G G
t−1 = θ (xt , PG , t),
representative subset of the images across different layers where PG is the text used to generate I G .

5
leaked into the generated image (e.g., shades of the red t- Algorithm 1 Plug-and-Play Diffusion Features
shirt and blue jeans are apparent in Layer 4-11). To achieve Inputs:
a better balance between preserving the structure of I G and I G . real guidance image
deviating from its appearance, we do not modify spatial fea- P . target text prompt
tures at deep layers, but rather leverage the self-attention τf , τA . injection thresholds
layers as discussed below.
xGT ← DDIM-inv(I )
G
∗ G
Self-attention. Self-attention modules compute the affini- xT ← xT . Starting from same seed
ties Alt between the spatial features after linearly project- for t ← T . . .n 1 doo
4 l

ing them into queries and keys. These affinities have a zG G
t−1 , f t , At ← θ xt , ∅, t
tight connection to the established concept of self-similarly,
xG G
t−1 ← DDIM-samp xt , z t−1
G
which has been used to design structure descriptors by both
classical and modern works [3,24,41,46]. This motivates us if t > τf then f ∗4 4 ∗4
t ← f t else f t ← ∅
∗l l ∗l
to consider the attention matrices Alt to achieve fine-grained if t > τA then At ← At n else Aot ←∅
control over the generated content. ∗ ∗ ∗4 ∗l
z t−1 ← ˆθ xt , P, t ; f t , At
Fig. 6, shows the leading principal components of a ma-
x∗t−1 ← DDIM-samp x∗t , z ∗t−1

trix Alt for a given image. As seen, in early layers, the at- end for
tention is aligned with the semantic layout of the image, Output: I ∗ ← x∗0
grouping regions according to semantic parts. Gradually,
higher-frequency information is captured.
Practically, injecting the self-attention matrix is done by
replacing the matrix Alt in Eq. 2. Intuitively, this operation from θ (xt , Pn , t). For example, using Pn that describes
pulls features close together, according to the affinities en- the guidance image, we can steer the denoised image away
from the original content.
coded in Alt . We denote this additional operation by modi-
fying Eq. (3) as follows: We use a parameter α ∈ [0, 1] to balance between neutral
and negative prompting:
z ∗t−1 = ˆθ (xt , P, t; f 4t , {Alt }) (4)
˜ = αθ (xt , ∅, t) + (1 − α)θ (xt , Pn , t) (6)
Fig. 5(b) shows the effect of Eq. (4) for increasing injec-
tion layers; the maximal injection layer of Alt controls the We plug ˜ instead of θ (xt , ∅, t) in Eq. (5). That is, =
level of fidelity to the original structure, while mitigating wθ (xt , P, t) + (1 − w)˜.
the issue of appearance leakage. Fig. 5(c) demonstrates In practice, we find negative-prompting to be beneficial
the pivotal role of the features f 4t . As seen, with only for handling textureless “primitives” guidance images (e.g.,
self-attention, i.e., z ∗t−1 = ˆθ (xt , P, t; {Alt }), there is no silhouette images). For natural-looking guidance images, it
semantic association between the original content and the plays a minor role. See Appendix A.1 for more details.
translated one, resulting in large deviations in structure.
Our plug-and-play diffusion features framework is sum- 5. Results
marized in Alg. 1, and is controlled by two parameters:
We thoroughly evaluate our method both quantitatively
(i) τf defines the sampling step t until which f 4t are in-
and qualitatively on diverse guidance image domains, both
jected. (ii) τA is the sampling step until which Alt are in- real and generated ones, as discussed below. Please see Ap-
jected. In all our results, we use a default setting where pendix C for full implementation details of our method.
self-attention is injected into all the decoder layers. The ex-
act settings of the parameters are discussed in Sec. 5.
Datasets. Our method supports versatile text-guided
image-to-image translation tasks and can be applied to arbi-
Negative-prompting. In classifier-free guidance [20], the
trary image domains. Since there is no existing benchmark
predicted noise at each sampling step is given by:
for such diverse settings, we created two new datasets: (i)
= wθ (xt , P, t) + (1−w) θ (xt , ∅, t) (5) Wild-TI2I, comprises of 148 diverse text-image pairs, 53%
of which consists of real guidance images that we gathered
where w > 1 is the guidance strength. That is, from the Web; (ii) ImageNet-R-TI2I, a benchmark we de-
is being extrapolated towards the conditional prediction rived from the ImageNet-R dataset [15], which comprises
θ (xt , P, t) and pushed away from the unconditional one of various renditions (e.g., paintings, embroidery, etc.) of
θ (xt , ∅, t). This increases the fidelity of the denoised ImageNet object classes. To adopt this dataset for our pur-
image to the prompt P , while allowing to deviate from pose, we manually selected 3 high-quality images from 10
θ (xt , ∅, t). Similarly, by replacing the empty prompt in different classes. To generate our image-text examples, we
Eq. (5) with a “negative” prompt Pn , we can push away created a list of text templates by defining for each source

6
“a photo of a “a photo of a “a photo of a
Guidance golden robot” wooden statue” sand sculpture” benchmark, for which we automatically created aligned
source-target prompts using the labels provided for the ren-
ditions and object categories. We further include quali-
tative comparison to a subset of Wild-TI2I for which the
source and target prompts are aligned. For evaluating P2P
on real guidance images, we applied DDIM inversion with
“a photo of a the source text as in [16].
golden sculpture “a photo of a “a photo of a
in a temple” wooden statue” silver robot”
Fig. 8 shows sample results of our method compared
with the baselines. As seen, our method successfully trans-
Wild TI2I-Real

lates diverse inputs, and works well for both real and gener-
ated guidance images. In all cases, our results exhibit both
high preservation to the guidance layout and high fidelity to
the target prompt. This is in contrast to SDEdit that suffers
“a photo of a “a photo of a “a photo of an
sparrow” hummingbird in eagle” from an inherent tradeoff between the two – with low noise
a forest”
level, the guidance structure is well preserved but in the ex-
panse of hardly changing the appearance; larger deviation
in appearances can be achieved with higher noise level, yet
the structure is damaged. VQGAN+CLIP exhibits the same
behavior, with overall lower image quality. Similarly, Dif-
“a sculpture of “an image of a “a sketch of a fuseIT shows high fidelity to the guiding shape, with little
a hummingbird” hummingbird” parrot”
changes to the appearance.
In comparison to P2P, it can be seen that their results on
generated guidance images (first 3 rows) depict high fidelity
ImageNet-R-TI2I

to the target prompt, yet only rough preservation of layout,

e.g., results in different number ducks (first row), or devia-
“an embroidery “a photo of a
tion from the mouse shape (second row). Furthermore, their
“a cartoon of a
of a pink jeep ” hippie colorful
jeep” method struggles to deviate from the guidance appearance
jeep””

and satisfy the target edit when it is applied to real images

(4-8 rows). We speculate that the reason is that DDIM in-
version in their case is applied with a source text, requir-
ing using low guidance scale at sampling. In contrast, our
method performs DDIM inversion with an empty prompt,
allowing us to use arbitrary guidance scale or prompts at
Figure 7. Sample results of our method on image-text pairs from generation.
Wild-TI2I and ImageNet-R-TI2I benchmarks.
We numerically evaluate these results using two com-
plementary metrics: text-image CLIP cosine similarity to
class target categories and styles, and automatically sam- quantify how well the generated images comply with the
pled their combinations. This results in total of 150 image- text prompt (higher is better), and distance between DINO-
text pairs. See Appendix D for full details. ViT self-similarity [46], to quantify the extent of structure
Figs. 1 and 7 show a sample of our results on both real preservation (lower is better).
and generated guidance images. Our results show both ad- As seen in Fig. 9, our method outperforms the baselines
herence to the guidance shape and compliance with differ- by achieving both high preservation of structure (in par with
ent target prompts. Our method successfully handles both SDEdit w/ very low noising level), and high fidelity to the
naturally looking as well as artistic and textureless guidance target text (in par with SDEdit w/ very high noising level).
images. We note that VQGAN-CLIP and DiffuseIT directly use the
evaluation metrics as their objective (CLIP loss in [9] and
5.1. Comparison to Prior/Concurrent Work DINO self-similarity in [25]), which explains their respec-
We focus our comparisons on state-of-the-art baselines tive scores in these metrics.
that can be applied to diverse text-guided I2I tasks, in-
cluding: (i) SDEdit [27] under three different noising lev- Extended comparison to P2P [16] To factor out the ef-
els, (ii) P2P [16], (iii) DiffuseIT [25], and (iv) VQGAN- fect of DDIM inversion, we expand our comparison to
CLIP [9]. We further provide qualitative comparisons to P2P on generated guidance images. Specifically, we cre-
Text2LIVE [4], FlexIT [8] and DiffusionCLIP [22]. ated a generated-ImageNet-R-TI2I benchmark by using text
We note that P2P requires a source prompt that is word- prompts expressing the same object classes and renditions
aligned to the target prompt. Thus, we include a qualitative described in Appendix D.1.
and quantitative comparison to P2P on our ImageNet-R-TI2I As seen in Fig. 10, both our method and P2P comply

7
Source image Ours P2P DiffuseIT SDEdit .6 SDEdit .75 SDEdit .85 VQ+CLIP

“a photo of
rubber ducks
walking on
street”
Wild TI2I-Generated

“a photo of a
golden robot”

“a photo of a
car made of ice
on the grass”

“a photo of
a white
wedding cake”
Wild TI2I-Real

“a photo of a silver
robot walking
on the moon”

“a photo of
a pizza”
ImageNet-R-TI2I

“an origami
of a panda”

“a tattoo of
a hummingbird”

Figure 8. Comparisons. Sample results are shown for each of the two benchmarks: ImageNet-R-TI2I and Wild-TI2I, which includes
both real and generated guidance images. From left to right: the guidance image and text prompt, our results, P2P [16], DiffuseIT [25],
SDedit [27] with 3 different noising levels, and VQ+CLIP [9].

with the target text. However, P2P often results in large Additional baselines. Fig. 11 shows qualitative compar-
deviations from the guidance structure, especially in cases isons with: (i) Text2LIVE [4], (ii) DiffusionCLIP [22], and
where multiple prompts edits are applied (last two rows in (iii) FlexIT [8]. These methods either fail to deviate from
Fig. 10). Our method demonstrates fine-grained structure the guidance image or result in noticeable visual artifacts.
preservation across all these examples, while successfully Since Text2LIVE is designed for layered textural editing,
translating multiple traits (e.g., both category and style). thus it can only “paint” over the guidance image and cannot
These properties are strongly evident by Fig. 9, where our apply any structural changes necessary to convey the target
method results in a significantly lower self-similarity dis- edit (e.g. dog to venom, church to Asian tower). More-
tance, even compared to P2P with injecting cross-attention over, Text2LIVE does not leverage a strong generative prior,
at all timesteps (t = 1000). hence often results in low visual quality. FlexIT often fails

8
Generated images
Real images (a) Wild-TI2I (b) ImageNet-R-TI2I (c) Generated ImageNet-R-TI2I
VQGAN P2P (t=500)
Ours -CLIP SDEdit (n=0.85) VQGAN
SDEdit (n=0.85) VQGAN

Structure Distance (DINO self-similarity)

(w/o self-att.) SDEdit (n=0.85) -CLIP
Structure Distance (DINO self-similarity)

Structure Distance (DINO self-similarity)

-CLIP P2P (t=1000)
Ours
(w/o features)
SDEdit (n=0.75)
SDEdit (n=0.75) SDEdit (n=0.75)

P2P
Ours (full)
Ours
SDEdit (n=0.6) SDEdit (n=0.6) Ours
SDEdit (n=0.6)
DiffuseIT
DiffuseIT Better Better Better
DiffuseIT

Text-Image Similarity (CLIP cosine similarity) Text-Image Similarity (CLIP cosine similarity) Text-Image Similarity (CLIP cosine similarity)

Figure 9. Quantitative evaluation. We measure CLIP cosine similarity (higher is better) and DINO-ViT self-similarity distance (lower is
better) to quantify the fidelity to text and preservation of structure, respectively. We report these metrics on three benchmarks: (a) Wild-TI2I
for which an ablation of our method is included, (b) ImageNet-R-TI2I, and (c) Generated-ImageNet-R-TI2I. Note that we could compare
to P2P only for (b) and (c) due to their prompts restriction. All baselines struggle to achieve both low structure distance and a high CLIP
score. Our method exhibit a better balance between these two ends across all benchmarks.

Guidance Ours P2P

fusionCLIP requires fine-tuning an unconditional diffusion
model for each target edit on a set of 30+ images from a
“a “an
graphic origami single domain (e.g. dog faces, churches).
of a of a
penguin penguin
” ” 5.2. Ablation
We ablate our key design choices by evaluating our per-
“an art
“a formance for the following cases: (i) w/o spatial features in-
photo
of a of a jection (w/o features), (ii) w/o self-attention injection. The
cat” cat”
metrics are reported in Fig. 9(a) and a representative exam-
ple is shown in Fig. 5. The results demonstrate that both fea-
tures and self-attention are critical for structure preservation
“a carto “a tatto – the features provide a semantic association between the
on of a of a
husky” husky” original and translated content, while self-attention is es-
sential for maintaining this association and capturing finer
structural information. Further ablations can be found in
Appendix A and Tab. 1.
“an “a
origami photo
of a of a
husky” poodle” 6. Discussion and Conclusion
We presented a new framework for diverse text-guided
“a “an image-to-image translation, founded on new insights about
painting image
of of the internal representation of a pre-trained text-to-image
a panda” a
bobcat” diffusion model. Our method, based on simple manipula-
tion of features, outperforms existing baselines, achieving
a significantly better balance between preserving the guid-
Figure 10. Comparison to P2P on generated ImageNet-R-TI2I ance layout and deviating from its appearance. As for lim-
benchmark. While P2P results demonstrate high fidelity to the the itations, our method relies on the semantic association be-
target text, there are noticeable deviation from the guidance struc- tween the original and translated content in the diffusion
ture, especially in cases of multiple word swaps (last two rows). feature space. Thus, it does not work well on detailed la-
Across all examples, our results adhere to the target edit while bel segmentation masks where regions are colored arbitrar-
preserving the guidance scene layout and object pose. ily (Fig. 12). In addition, we are relying on DDIM inver-
sion, which we found to work well in most of our examples.
to deviate from the guidance content, which may be caused Nevertheless, we observed that for textureless “minimal”
to their regularization that encourages the guidance and out- images, DDIM may occasionally result in a latent that en-
put images to match in LPIPS sense. We also note that Dif- codes dominant low-frequency appearance information, in

9
Guidance Text2Live DiffusionCLIP FlexIT Ours References
“a photo
of [1] Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. Restyle:
Venom”
A residual-based stylegan encoder via iterative refinement.
In Proceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), 2021. 2
“a photo
of [2] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
a bear”
diffusion for text-driven editing of natural images. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
“a photo and Pattern Recognition, 2022. 2
of a
yorkshire [3] Shai Bagon, Ori Brostovski, Meirav Galun, and Michal Irani.
terrier”
Detecting and sketching the common. In 2010 IEEE Com-
puter Society Conference on Computer Vision and Pattern
“a photo
of an Recognition, 2010. 6
ancient
Asian [4] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas-
tower”
ten, and Tali Dekel. Text2live: Text-driven layered image
and video editing. In European Conference on Computer
“a photo
of a Vision. Springer, 2022. 2, 7, 8, 10
wooden
house ” [5] Dmitry Baranchuk, Andrey Voynov, Ivan Rubachev,
Valentin Khrulkov, and Artem Babenko. Label-efficient se-
“a photo mantic segmentation with diffusion models. In International
of a Conference on Learning Representations, 2022. 5
golden
church” [6] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park,
Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by
word. arXiv preprint arXiv:2103.10951, 2021. 2
Figure 11. Qualitative comparisons to additional baselines:
[7] Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and
Text2LIVE [4], DiffusionCLIP [22], FlexIT [8]. These methods
Shi-Min Hu. Sketch2photo: internet image montage. ACM
fail to deviate from the structure for matching the target prompt,
Trans. Graph., 2009. 2
or create undesired artifacts.
[8] Guillaume Couairon, Asya Grechka, Jakob Verbeek, Hol-
ger Schwenk, and Matthieu Cord. FlexIT: Towards flexi-
ble semantic image translation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition
(CVPR), 2022. 7, 8, 10
[9] Katherine Crowson, Stella Biderman, Daniel Kornis,
Guidance Image Ours Guidance Image Ours
Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed-
ward Raff. Vqgan-clip: Open domain image generation and
Figure 12. Limitations. Our method fails when there is no seman-
editing with natural language guidance. In European Con-
tic association between the guidance content and the target text.
ference on Computer Vision, pages 88–105. Springer, 2022.
Thus, it does not perform well on solid segmentation masks with
7, 8
arbitrary colors.
[10] Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, and
William T Freeman. Sparse, smart contours to represent and
which case some appearance information would leak into edit images. In Proceedings of the IEEE Conference on Com-
our results. We believe that our work demonstrates the yet puter Vision and Pattern Recognition, 2018. 2
unrealized potential of the rich and powerful feature space [11] Prafulla Dhariwal and Alexander Nichol. Diffusion models
spanned by pre-trained text-to-image diffusion models. We beat gans on image synthesis. Advances in Neural Informa-
hope it will motivate future research in this direction. tion Processing Systems, 2021. 3
[12] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin,
Acknowledgments: We thank Omer Bar-Tal for his in- Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-
sightful comments and discussion. This project re- based text-to-image generation with human priors. In Eu-
ceived funding from the Israeli Science Foundation (grant ropean Conference on Computer Vision (ECCV), 2022. 1, 2,
3
2303/20), the Carolito Stiftung, and the NVIDIA Applied
Research Accelerator Program. Dr. Bagon is a Robin [13] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
Chemers Neustein AI Fellow.
Or. An image is worth one word: Personalizing text-to-
image generation using textual inversion. arXiv preprint
arXiv:2208.01618, 2022. 3
[14] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and
Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adap-
tation of image generators. ACM Transactions on Graphics
(TOG), 2022. 2

10
[15] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- translation. In European Conference on Computer Vision,
vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, 2020. 2
Samyak Parajuli, Mike Guo, et al. The many faces of robust- [30] Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli
ness: A critical analysis of out-of-distribution generalization. Shechtman, Alexei Efros, and Richard Zhang. Swapping au-
In ICCV, 2021. 6 toencoder for deep image manipulation. Advances in Neural
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Information Processing Systems, 2020. 2
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- [31] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or,
age editing with cross attention control. arXiv preprint and Dani Lischinski. Styleclip: Text-driven manipulation of
arXiv:2208.01626, 2022. 2, 3, 7, 8 stylegan imagery. In Proceedings of the IEEE/CVF Interna-
[17] Aaron Hertzmann, Charles E. Jacobs, Nuria Oliver, Brian tional Conference on Computer Vision (ICCV), 2021. 2
Curless, and David Salesin. Image analogies. In Lynn [32] Lara Raad and Bruno Galerne. Efros and freeman image
Pocock, editor, ACM Trans. on Graphics (Proceedings of quilting algorithm for texture synthesis. Image Process. Line,
ACM SIGGRAPH), 2001. 2 2017. 2
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
sion probabilistic models. Advances in Neural Information Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Processing Systems, 2020. 3 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[19] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Krueger, and Ilya Sutskever. Learning transferable visual
Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models from natural language supervision. In Proceedings
models for high fidelity image generation. J. Mach. Learn. the International Conference on Machine Learning (ICML),
Res., 23:47–1, 2022. 3 2021. 2
[20] Jonathan Ho and Tim Salimans. Classifier-free diffusion [34] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
guidance. In NeurIPS 2021 Workshop on Deep Generative and Mark Chen. Hierarchical text-conditional image gen-
Models and Downstream Applications, 2021. 6 eration with clip latents. arXiv preprint arXiv:2204.06125,
[21] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen 2022. 1, 2, 3
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: [35] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan,
Text-based real image editing with diffusion models. arXiv Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding
preprint arXiv:2210.09276, 2022. 3 in style: a stylegan encoder for image-to-image translation.
[22] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- In Proceedings of the IEEE/CVF conference on computer vi-
fusionclip: Text-guided diffusion models for robust image sion and pattern recognition, 2021. 2
manipulation. In Proceedings of the IEEE/CVF Conference [36] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
on Computer Vision and Pattern Recognition, 2022. 2, 3, 7, Patrick Esser, and Björn Ommer. High-resolution image syn-
8, 10 thesis with latent diffusion models, 2021. 1, 2, 3, 4
[23] Kunhee Kim, Sanghun Park, Eunyeong Jeon, Taehun Kim, [37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
and Daijin Kim. A style-aware discriminator for controllable net: Convolutional networks for biomedical image segmen-
image translation. In Proceedings of the IEEE/CVF Confer- tation. In International Conference on Medical image com-
ence on Computer Vision and Pattern Recognition (CVPR), puting and computer-assisted intervention, 2015. 3
2022. 2 [38] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[24] Nicholas Kolkin, Jason Salavon, and Gregory Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Shakhnarovich. Style transfer by relaxed optimal transport tuning text-to-image diffusion models for subject-driven
and self-similarity. In Proceedings of the IEEE/CVF Con- generation. arXiv preprint arXiv:2208.12242, 2022. 3
ference on Computer Vision and Pattern Recognition, 2019. [39] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
6 Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
[25] Gihyun Kwon and Jong Chul Ye. Diffusion-based image Norouzi. Palette: Image-to-image diffusion models. In ACM
translation using disentangled style and content representa- SIGGRAPH 2022 Conference Proceedings, 2022. 2, 3
tion. arXiv preprint arXiv:2209.15264, 2022. 2, 7, 8 [40] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
[26] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
text-to-image generation with improved clip+ gan space op- Rapha Gontijo Lopes, et al. Photorealistic text-to-image
timization. arXiv preprint arXiv:2112.01573, 2021. 2 diffusion models with deep language understanding. arXiv
[27] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- preprint arXiv:2205.11487, 2022. 1, 3
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and [41] Eli Shechtman and Michal Irani. Matching local self-
editing with stochastic differential equations. arXiv preprint similarities across images and videos. In IEEE Conference
arXiv:2108.01073, 2021. 3, 7, 8 on Computer Vision and Pattern Recognition, 2007. 6
[28] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [42] Yichang Shih, Sylvain Paris, Frédo Durand, and William T.
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Freeman. Data-driven hallucination of different times of day
Mark Chen. Glide: Towards photorealistic image generation from a single outdoor photo. ACM Trans. Graph., 2013. 2
and editing with text-guided diffusion models. arXiv preprint [43] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
arXiv:2112.10741, 2021. 2, 3 and Surya Ganguli. Deep unsupervised learning using
[29] Taesung Park, Alexei A. Efros, Richard Zhang, and Jun- nonequilibrium thermodynamics. In International Confer-
Yan Zhu. Contrastive learning for unpaired image-to-image ence on Machine Learning. PMLR, 2015. 3

11
[44] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. In International Conference
on Learning Representations, 2020. 4, 5
[45] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
Daniel Cohen-Or. Designing an encoder for stylegan im-
age manipulation. ACM Transactions on Graphics (TOG),
40(4):1–14, 2021. 2
[46] Narek Tumanyan, Omer Bar-Tal, Shai Bagon, and Tali
Dekel. Splicing vit features for semantic appearance transfer.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2022. 2, 6, 7
[47] Yael Vinker, Eliahu Horwitz, Nir Zabari, and Yedid Hoshen.
Image shape manipulation from a single augmented training
sample. In Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pages 13769–13778, 2021. 2
[48] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong
Chen, Qifeng Chen, and Fang Wen. Pretraining is all you
need for image-to-image translation. In arXiv, 2022. 2, 3
[49] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision, 2017. 2

12
A. Ablations No Negative Negative
Guidance Prompting Prompting
A.1. Negative-prompting.
We qualitatively and quantitatively ablate the effect “a photo
of negative prompting (see Sec. 4 of the main paper). of a
humming
Tab. 1 compares our metrics w/ and w/o negative prompt- -bird”

ing (bottom row and second row), using our Wild-TI2I and
ImageNet-R-TI2I benchmarks. The results indicate that the
usage of negative prompting (bottom row) leads to slightly “a
larger deviation from the guidance image (higher LPIPS photoreal
istic
distance between I G and I ∗ ), while introducing only a mi- image of
a wooden
nor reduction in structure preservation. Sample results of sculpture”
this ablation are shown in Fig. 13, where we can also notice
that negative prompting has a larger effect for “primitive im-
ages”, i.e, simple “textureless” images such as silhouettes
“an
(top two rows) than natural guidance images. image of
Kung Fu
Panda”
A.2. Injected features.
Our method injects features to the decoder block, in a
specific layer which we observed to capture localized se- Figure 13. Qualitative ablation of negative prompting. The effect
mantic information. To complete our analysis, we extend of negative prompting is most meaningful on textureless guidance
our PCA feature visualization to include both the decoder images. In the case of realistic images (row 3) it has a minor effect.
and encoder features. As seen in Fig. 15, the encoder re-
semble a mirrored trend to the decoder: the encoder features
start with high frequency noise (layer 1), which is grad- those generated under the same prompt P with different
ually transformed into cleaner features that depict lower- seeds {xiT }. Specifically, we used 10 different prompts and
frequency content throughout the layers. Nevertheless, lo- 10 different seeds to generate 100 images, using all combi-
calized semantic information is overall less apparent in the nations. We considered the images generated under: (i) the
encoder’s features. To numerically evaluate this, we con- same prompt across different seeds, and (ii) the same seed
sider a modified version of our method where the encoder across different prompts. Total of 20 sets of 10 images each.
features from layer 7, which resemble some semantic in- We then measured the variance between the feature maps
formation, are additionally injected. As seen in Tab. 1, this within each of these sets. In Fig. 14 (top), we report these
combination results in worse CLIP score in all data-sets and variances (averaged across spatial location) as a function
smaller LPIPS deviation from the guidance image on most of the encoder layer l. As seen, changing the initial seed,
sets (first row). for any fixed prompt, results in significantly higher variance
across features for all layers l compared to fixing the seed
B. Initial noise xT and spatial features and changing only the prompt. These findings validate our
hypothesis and support our method’s dependency on the ini-
We observed that in order for our method to work, the tial seed.
initial noise used to generate the translated image x∗T has to
match the initial guidance noise xG T . Since we inject fea-
tures into the decoder from the very first step of the back-
C. Implementation Details
ward process, this dependency on the random seed can only We use Stable Diffusion as our pre-trained text-to-image
be explained by the encoder features at t = T , denoted by model; we use the StableDiffusion-v-1-4 checkpoint pro-
f et l . Recall that these features depend on both x∗T and the vided via official HuggingFace webpage.
target prompt P . This raises the question: why f Tel origi- In all of our experiments, we use DDIM deterministic
nated from x∗T = xG T and an arbitrary text prompt P , allow sampling with 50 steps. In the case of real guidance im-
our method to work? We hypothesize that in t = T the target ages, we perform deterministic DDIM inversion with 1000
prompt has little effect on the encoder features f eTl , thus the forward steps and then perform deterministic DDIM sam-
injected decoder features f 4T comply with the encoder fea- pling with 1000 backward steps. Our translation results are
tures. In contrast, changing the seed results in a mismatch performed with 50 sampling steps, thus we extract features
between f 4T and f Tel . This may be surprising since images only at these steps. We set our default injection thresholds
generated from the same seed under different text prompts to: τA = 25, τf = 40 out of the 50 sampling steps; for
may dramatically differ from one another (see Fig. 14 bot- primitive guidance image, we found that τA = τf = 25 to
tom). work better.
To validate this hypothesis, we performed an analysis During translation, we set the classifier-free guidance
that shows that features formed from the same xT under ar- scale for real and generated guidance images to 15.0 and
bitrary prompts {Pi } are significantly more correlated than 7.5, respectively. The use of negative prompting is con-

13
Wild-TI2I Real Wild-TI2I Generated ImageNet-R-TI2I
Self-Sim ↓ CLIP ↑ LPIPS ↑ Self-Sim ↓ CLIP ↑ LPIPS ↑ Self-Sim ↓ CLIP ↑ LPIPS ↑
w/ encoder-feat-7 0.058 0.280 0.527 0.035 0.264 0.453 0.05 0.273 0.458
w/o neg. prompt 0.052 0.281 0.490 0.033 0.275 0.441 0.048 0.274 0.451
w/o feat. 0.090 0.288 0.584 0.084 0.297 0.633 0.076 0.281 0.534
w/o self-attn. 0.097 0.286 0.597 0.090 0.295 0.657 0.089 0.278 0.564
Our method 0.058 0.282 0.521 0.048 0.289 0.542 0.051 0.275 0.462

Table 1. Quantitative evaluation on WILD-Real benchmark. We evaluate the distance in DINO-ViT self-similarity for structure
preservation, CLIP score for target text faithfulness and LPIPS distance for deviation from the guidance image. We ablate the features
injection, self-attention injection, negative prompting, and additional feature injection in the encoder blocks. We report these scores on our
three text guided I2I benchmarks. The configuration reported in the main paper (encoder features + self-attention injection and negative
prompting) is the best balance between the three metrics across the datasets.

Same prompt, realistic generated images, and for generated images that
diffrent seeds
Different prompt,
are primitive/textureless, we use an exponential scheduler
same seed α(t) = e−6·t .
For running the competitors, we use their official imple-
mentations: Prompt-to-Prompt, DiffuseIT, DiffusionCLIP,
Text2LIVE, FlexIT. For running SDEdit on StableDiffu-
sion, we use the implementation available in the Stable Dif-
fusion official repo. For running VQGAN-CLIP, we used
the publicly available repo.
Input Encoder features Bottleneck

Seed = 30 Seed = 30 Seed = 30 Variance across

“an australian “a leopard, a
“a spotted cow,
a cartoon”
stock horse, an photo,
images
(Encoder layer 7)
D. Benchmarks
oil painting"” zoom out”

Seed = 30 D.1. ImageNet-R-TI2I benchmark.

“a zebra, line
drawing”

Seed = 50 Seed = 60 Seed = 90 To test our method on a wide range of guidance

“a zebra, line “a zebra, line “a zebra, line
drawing” drawing” drawing” images, we turn to Image-Net-R [15], a dataset that
contains various renditions of 200 classes from Image-
Net. We manually select 10 classes: “Castle”, “Cat”,
“Goldfish”, “Hummingbird”, “Husky”, “Jeep”,
Figure 14. Measuring features’ variance: different prompts vs. “Panda”, “Penguin”, “Pizza”, “Violin”. To avoid
different seeds. We consider 10 different seeds and 10 different low-quality images, we manually selected 3 images per
prompts, and generate 100 images using all combinations. We class, totaling 30 guidance images.
extract features from the encoder (which are skipped to the de- Additionally, we automatically created 5 target prompts
coder) for all generated images at t = T , and compute the variance
per image. All the prompts share the same template:
over features originated from: (i) the same prompt across different
seeds, and (ii) the same seed across different prompts. The esti- “rendition of a class”, e.g. “a painting of a
mated variance under each of these settings is plotted in orange jeep”. rendition is one of the existing renditions in
and blue bars, respectively. We observe that although the gener- the real ImageNet-R data-set: “an art”, “a cartoon”,
ated images using (ii) do not exhibit shared visual properties, their “a graphic”, “a deviantart”, “a painting”,
features are correlated (low variance). In contrast, images gen- “a sketch”, “a graffiti”, “an embroidery”,
erated using (i) are more visually similar, yet their features are “an origami”, “a pattern”, “a sculpture”, “a
significantly less correlated. tattoo”, “a toy”, “a video-game”, “a photo”,
“an image”. For two (out of five) target prompts per im-
age, we changed the correct class to another object
trolled via a hyperparameter α used to interpolate between
class randomly sampled from 5 related classes (to avoid
the predicted noise θ (N ) and the predicted noise θ (∅),
completely unreasonable translations such as penguin →
as described in Sec. 4. We set an initial value α0 in the
jeep).
first sampling step, and then gradually decrease it. For
real guidance images, we set α0 = 1.0 and use a linear Overall, our ImageNet-R-TI2I benchmark contains 150
scheduler α(t) = α0 − t. For generated guidance im- image-text pairs: 30 guidance images with 5 target prompts
ages, we set α0 = 0.75. We use the linear scheduler for each.

14
Generated images Real images
Input
layer=1
layer=4
Encoder
layer=7
layer=11
layer=1
layer=4
Decoder
layer=7
layer=11

Figure 15. Visualizing diffusion features for both encoder and decoder. Extending the visualization of Fig. 3 in the main paper to include
features from encoder blocks of the U-Net at time t = 540 (top part).

D.2. Wild TI2I benchmark.

We collected a diverse dataset of 148 text-image pairs,
containing different object classes (people, animals, food,
landscapes) in different renditions (realistic images, draw-
ings, solid masks, sketches and illustrations) with different
levels of semantic details. 53% of the examples consists of
real guidance images that we gathered from the Web, and
the rest are generated from text.
We will publicly release our benchmarks and code for
academic use.

DSM 5 Chart
93% (30)
DSM 5 Chart
2 pages
CTRF
No ratings yet
CTRF
2 pages
Corba Book
100% (1)
Corba Book
286 pages
Documents 5
No ratings yet
Documents 5
5 pages
Text-to-Image Synthesis With Generative Models Methods Datasets Performance Metrics Challenges and Future Direction Basiv
No ratings yet
Text-to-Image Synthesis With Generative Models Methods Datasets Performance Metrics Challenges and Future Direction Basiv
16 pages
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
No ratings yet
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
11 pages
Text-to-Image Synthesis With Generative Models Met
No ratings yet
Text-to-Image Synthesis With Generative Models Met
16 pages
Dream Booth
No ratings yet
Dream Booth
25 pages
Meta
No ratings yet
Meta
17 pages
Ruiz DreamBooth Fine Tuning Text-to-Image Diffusion Models For Subject-Driven Generation CVPR 2023 Paper
No ratings yet
Ruiz DreamBooth Fine Tuning Text-to-Image Diffusion Models For Subject-Driven Generation CVPR 2023 Paper
11 pages
Engproc 20 00016 With Cover
No ratings yet
Engproc 20 00016 With Cover
7 pages
Survey Paper On Text-to-Image Generation
No ratings yet
Survey Paper On Text-to-Image Generation
8 pages
Utilizing Generative AI For Text-To-Image Generation
No ratings yet
Utilizing Generative AI For Text-To-Image Generation
6 pages
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
No ratings yet
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
11 pages
Text-to-Image Generation Using Deep Learning
No ratings yet
Text-to-Image Generation Using Deep Learning
6 pages
An Adaptive Approach To Text To Image
No ratings yet
An Adaptive Approach To Text To Image
5 pages
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
No ratings yet
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
21 pages
Paper Math
No ratings yet
Paper Math
13 pages
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
No ratings yet
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
9 pages
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
100% (1)
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
7 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
BTP - 6 Sem - Part1
No ratings yet
BTP - 6 Sem - Part1
40 pages
Generating AI Text To Image A Comprehensive Guide
No ratings yet
Generating AI Text To Image A Comprehensive Guide
3 pages
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
No ratings yet
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
8 pages
Yayi Final Seminar
No ratings yet
Yayi Final Seminar
19 pages
Final All Correct
No ratings yet
Final All Correct
49 pages
Image-To-Image Translation Methods and Applications
No ratings yet
Image-To-Image Translation Methods and Applications
23 pages
A Survey of AI Text-to-Image and AI Text-to-Video Generators
No ratings yet
A Survey of AI Text-to-Image and AI Text-to-Video Generators
5 pages
32636-Article Text-36704-1-2-20250410
No ratings yet
32636-Article Text-36704-1-2-20250410
9 pages
Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models
No ratings yet
Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models
12 pages
28226-Article Text-32280-1-2-20240324
No ratings yet
28226-Article Text-32280-1-2-20240324
9 pages
Neural Crossbreed: Neural Based Image Metamorphosis: Sanghun Park, Kwanggyoon Seo, Junyong Noh
No ratings yet
Neural Crossbreed: Neural Based Image Metamorphosis: Sanghun Park, Kwanggyoon Seo, Junyong Noh
16 pages
Development and Deployment of A Generative Model-Based Framework For Text To Photorealistic Image Generation
No ratings yet
Development and Deployment of A Generative Model-Based Framework For Text To Photorealistic Image Generation
16 pages
Text To Image Survey
No ratings yet
Text To Image Survey
40 pages
SanjanaSademba 2205348.
No ratings yet
SanjanaSademba 2205348.
8 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Ai Art Promptsmodifiers Refine Image Quality
No ratings yet
Ai Art Promptsmodifiers Refine Image Quality
15 pages
Dehouce
No ratings yet
Dehouce
12 pages
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
No ratings yet
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
10 pages
Dual Adversarial Inference For Text-to-Image Synthesis
No ratings yet
Dual Adversarial Inference For Text-to-Image Synthesis
20 pages
Mpai05 - Final Document
No ratings yet
Mpai05 - Final Document
40 pages
G: Open-Set Grounded Text-to-Image Generation: Ligen
No ratings yet
G: Open-Set Grounded Text-to-Image Generation: Ligen
21 pages
Diff IT
No ratings yet
Diff IT
22 pages
Text-To-Image Generation Using Generative AI
No ratings yet
Text-To-Image Generation Using Generative AI
5 pages
Deep Learning Based Text To Image Genera
No ratings yet
Deep Learning Based Text To Image Genera
6 pages
Ttoimage Merged
No ratings yet
Ttoimage Merged
57 pages
(ICCV-2023) Expressive Text-to-Image Generation With Rich Text
No ratings yet
(ICCV-2023) Expressive Text-to-Image Generation With Rich Text
29 pages
AI Art in Architecture
No ratings yet
AI Art in Architecture
11 pages
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
No ratings yet
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
11 pages
Background and Literature Review
No ratings yet
Background and Literature Review
7 pages
Background and Literature Review
No ratings yet
Background and Literature Review
17 pages
W Pg#s
No ratings yet
W Pg#s
17 pages
Prompt To Prompt - Preprint
No ratings yet
Prompt To Prompt - Preprint
36 pages
Discriminative Probing and Tuning For Text-To-Image Generation
No ratings yet
Discriminative Probing and Tuning For Text-To-Image Generation
22 pages
【2022】RiFeGAN2 Rich Feature Generation for Text-To-Image Synthesis From Constrained Prior Knowledge
No ratings yet
【2022】RiFeGAN2 Rich Feature Generation for Text-To-Image Synthesis From Constrained Prior Knowledge
14 pages
Ijirt180636 Paper
No ratings yet
Ijirt180636 Paper
6 pages
AI Image Generator PPT-1
No ratings yet
AI Image Generator PPT-1
15 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Saw Gan
No ratings yet
Saw Gan
11 pages
Image-to-Image Translation: Methods and Applications
No ratings yet
Image-to-Image Translation: Methods and Applications
19 pages
On Text To Image Generator
No ratings yet
On Text To Image Generator
10 pages
Design Guidelines For Prompt Engineering
No ratings yet
Design Guidelines For Prompt Engineering
23 pages
Three Dimensional Computer Graphics: Exploring the Intersection of Vision and Virtual Worlds
From Everand
Three Dimensional Computer Graphics: Exploring the Intersection of Vision and Virtual Worlds
Fouad Sabry
No ratings yet
Automatic Power Switching Mains, Solar, Inverter
No ratings yet
Automatic Power Switching Mains, Solar, Inverter
14 pages
Part 1 Icao by Diogo
No ratings yet
Part 1 Icao by Diogo
8 pages
JBL Bar Studio
No ratings yet
JBL Bar Studio
2 pages
The Geisha Memory 2
No ratings yet
The Geisha Memory 2
25 pages
Government College of Engineering and Technology Jammu
No ratings yet
Government College of Engineering and Technology Jammu
20 pages
Basic Question Bank With Answers and Explanations
No ratings yet
Basic Question Bank With Answers and Explanations
275 pages
Lecture No. 08. Manual Techniques at Shoulder - A
No ratings yet
Lecture No. 08. Manual Techniques at Shoulder - A
29 pages
Final Exam - Math 111-Second Term 222
No ratings yet
Final Exam - Math 111-Second Term 222
7 pages
History of Sport - Wikipedia
No ratings yet
History of Sport - Wikipedia
19 pages
Accounting Information Systems 14th Edition (Ebook PDF) Download
100% (1)
Accounting Information Systems 14th Edition (Ebook PDF) Download
58 pages
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
No ratings yet
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
5 pages
اخلاق طبابت
No ratings yet
اخلاق طبابت
230 pages
Ejobscircular Com SSC Result 2020
No ratings yet
Ejobscircular Com SSC Result 2020
17 pages
Free API
No ratings yet
Free API
3 pages
Experiment No. 7: Numerical Aperture of The Optical Fiber
No ratings yet
Experiment No. 7: Numerical Aperture of The Optical Fiber
4 pages
GIVER Study Guide
No ratings yet
GIVER Study Guide
5 pages
Ysio
100% (1)
Ysio
252 pages
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
No ratings yet
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
10 pages
Manual Bomba Horizontal Clase D PDF
No ratings yet
Manual Bomba Horizontal Clase D PDF
24 pages
UT315A Software Installation Instruction
No ratings yet
UT315A Software Installation Instruction
4 pages
Lm3622 Aplication Circuit
No ratings yet
Lm3622 Aplication Circuit
2 pages
At Home and Abroad
No ratings yet
At Home and Abroad
6 pages
DLL Mapeh-5 Q2
No ratings yet
DLL Mapeh-5 Q2
99 pages
Amazonico London A La Carte Menu
No ratings yet
Amazonico London A La Carte Menu
2 pages
Mechatronics Q & A
No ratings yet
Mechatronics Q & A
3 pages
Service Catalog
No ratings yet
Service Catalog
3 pages
Morphology of Flowering Plants Learn Cbse
No ratings yet
Morphology of Flowering Plants Learn Cbse
6 pages