0% found this document useful (0 votes)

56 views20 pages

3 DDIM Inversion

Uploaded by

Juan Ponitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views20 pages

3 DDIM Inversion

Uploaded by

Juan Ponitra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Ron Mokady* † 1,2 , Amir Hertz* † 1,2 , Kfir Aberman1 , Yael Pritch1 , and Daniel Cohen-Or† 1,2
1
Google Research, 2 The Blavatnik School of Computer Science, Tel Aviv University

Input Image DDIM Inversion Null-text Inversion Prompt-to-Prompt image editing

arXiv:2211.09794v1 [cs.CV] 17 Nov 2022

Input caption: “Zoom photo of flowers.” “...origami flowers...” flowers cupcakes “...wither flowers...” photo sketch

Input caption: “A cat sitting next to a mirror.” “...silver cat sculpture...” cat tiger “...sleeping cat...” “Watercolor drawing of...”
Figure 1. Null-text inversion for real image editing. Our method takes as input a real image (leftmost column) and an associated caption.
The image is inverted with a DDIM diffusion model to yield a diffusion trajectory (second column to the left). Once inverted, we use the
initial trajectory as a pivot for null-text optimization that accurately reconstructs the input image (third column to the left). Then, we can
edit the inverted image by modifying only the input caption using the editing technique of Prompt-to-Prompt [16] .

Abstract 1. Introduction
Recent text-guided diffusion models provide powerful The progress in image synthesis using text-guided diffu-
image generation capabilities. Currently, a massive effort sion models has attracted much attention due to their excep-
is given to enable the modification of these images using tional realism and diversity. Large-scale models [27, 30, 32]
text only as means to offer intuitive and versatile editing. To have ignited the imagination of multitudes of users, en-
edit a real image using these state-of-the-art tools, one must abling image generation with unprecedented creative free-
first invert the image with a meaningful text prompt into the dom. Naturally, this has initiated ongoing research efforts,
pretrained model’s domain. In this paper, we introduce an investigating how to harness these powerful models for im-
accurate inversion technique and thus facilitate an intuitive age editing. Most recently, intuitive text-based editing was
text-based modification of the image. Our proposed inver- demonstrated over synthesized images, allowing the user to
sion consists of two novel key components: (i) Pivotal in- easily manipulate an image using text only [16].
version for diffusion models. While current methods aim at However, text-guided editing of a real image with these
mapping random noise samples to a single input image, we state-of-the-art tools requires inverting the given image and
use a single pivotal noise vector for each timestamp and textual prompt. That is, finding an initial noise vector that
optimize around it. We demonstrate that a direct inver- produces the input image when fed with the prompt into
sion is inadequate on its own, but does provide a good an- the diffusion process while preserving the editing capabili-
chor for our optimization. (ii) null-text optimization, where ties of the model. The inversion process has recently drawn
we only modify the unconditional textual embedding that considerable attention for GANs [7,41], but has not yet been
is used for classifier-free guidance, rather than the input fully addressed for text-guided diffusion models. Although
text embedding. This allows for keeping both the model an effective DDIM inversion [13,35] scheme was suggested
weights and the conditional embedding intact and hence for unconditional diffusion models, it is found lacking for
enables applying prompt-based editing while avoiding the text-guided diffusion models when classifier-free guidance
cumbersome tuning of the model’s weights. Our null-text [18], which is necessary for meaningful editing, is applied.
inversion, based on the publicly available Stable Diffusion In this paper, we introduce an effective inversion scheme,
model, is extensively evaluated on a variety of images and achieving near-perfect reconstruction, while retaining the
prompt editing, showing high-fidelity editing of real images. rich text-guided editing capabilities of the original model
(see Fig. 1). Our approach is built upon the analysis of two
* Equal contribution. key aspects of guided diffusion models: classifier-free guid-
† Performed this work while working at Google. ance and DDIM inversion.

1
Input caption: “A baby wearing a blue shirt lying on the sofa.”
In the widely used classifier-free guidance, in each dif-
fusion step, the prediction is performed twice: once uncon-
ditionally and once with the text condition. These predic-
tions are then extrapolated to amplify the effect of the text
guidance. While all works concentrate on the conditional
prediction, we recognize the substantial effect induced by Input Image “... blond baby...” “... golden shirt...”
“... floral shirt...”
the unconditional part. Hence, we optimize the embedding
used in the unconditional part in order to invert the input
image and prompt. We refer to it as null-text optimization,
as we replace the embedding of the empty text string with
our optimized embedding.
DDIM Inversion consists of performing DDIM sampling “... sleeping baby...” “baby” “robot” “sofa” “grass” “sofa” “ball pit”
in reverse order. Although a slight error is introduced in Input caption: “A man in glasses eating a doughnut in the park.”
each step, this works well in the unconditional case. How-
ever, in practice, it breaks for text-guided synthesis, since
classifier-free guidance magnifies its accumulated error. We
observe that it can still offer a promising starting point
for the inversion. Inspired by GAN literature, we use the
sequence of noised latent codes, obtained from an initial Input Image “... red-haired man...” “glasses” “sunglasses” “angry man...”

DDIM inversion, as pivot [29]. We then perform our opti-

mization around this pivot to yield an improved and more
accurate inversion. We refer to this highly efficient op-
timization as Diffusion Pivotal Inversion, which stands in
contrast to existing works that aim to map all possible noise
“doughnut” “pizza” “glasses” “Joker mask” “...the park at sunset.” “park” “desert”
vectors to a single image.
To the best of our knowledge, our approach is the first Figure 2. Real image editing using our method. We first apply
to enable the text editing technique of Prompt-to-Prompt a single null-text inversion over the real input image, achieving
[16] on real images. Moreover, unlike recent approaches high-fidelity reconstruction. Then, various Prompt-to-Prompt text-
[19, 39], we do not tune the model weights, thus avoiding based editing operations are applied. As can be seen, our inver-
sion scheme provides high fidelity while retaining high editability.
damaging the prior of the trained model and duplicating
See additional examples in Appendix C (Fig. 10).
the entire model for each image. Throughout comprehen-
sive ablation study and comparisons, we demonstrate the
contribution of our key components to achieving a high- from latent-based optimization [1, 2] and encoders [28, 38]
fidelity reconstruction of the given real image, while allow- to feature space encoders [40] and fine-tuning of the model
ing meaningful and intuitive editing abilities. For our code, [3, 29]. Motivated by this, Gal et al. [15] suggest a textual
built upon the publicly available Stable Diffusion model, inversion scheme for diffusion models that enables regen-
please visit our project page https://null- text- erating a user-provided concept out of 3 − 5 images. Con-
inversion.github.io/. currently, Ruiz et al. [31] tackled the same task with model-
tuning. However, these works struggle to edit a given real
2. Related Work image while accurately reproducing the unedited parts.
Naturally, recent works have attempted to adapt text-
Large-scale diffusion models, such as Imagen [32], guided diffusion models to the fundamental challenge of
DALL-E 2 [27], and Stable Diffusion [30], have recently single-image editing, aiming to exploit their rich and di-
raised the bar for the task of generating images condi- verse semantic knowledge. Meng et al. [23] add noise to the
tioned on plain text, known as text-to-image synthesis. input image and then perform a text-guided denoising pro-
Exploiting the powerful architecture of diffusion models cess from a predefined step. Yet, they struggle to accurately
[17, 17, 30, 34–36], these models can generate practically preserve the input image details. To overcome this, several
any image by simply feeding a corresponding text, and so works [4, 5, 25] assume that the user provides a mask to re-
have changed the landscape of artistic applications. strict the region in which the changes are applied, achieving
However, synthesizing very specific or personal objects both meaningful editing and background preservation.
which are not widespread in the training data has been chal- However, requiring that users provide a precise mask is
lenging. This requires an inversion process that given in- burdensome. Furthermore, masking the image content re-
put images would enable regenerating the depicted object moves important information, which is mostly ignored in
using a text-guided diffusion model. Inversion has been the inpainting process. While some text-only editing ap-
studied extensively for GANs [7, 10, 22, 41, 42, 44], ranging proaches are bound to global editing [12, 20, 21], Bar-Tal

2
DDIM Inversion
et al. [6] propose a text-based localized editing technique
without using any mask. Their technique allows high-
quality texture editing, but not modifying complex struc- Pivotal tuning

tures, since only CLIP [26] is employed as guidance instead Input Image
of a generative diffusion model.
Hertz et al. [16] suggest an intuitive editing technique,
called Prompt-to-Prompt, of manipulating local or global Pivotal Tuning by Null-text Optimization

details by modifying only the text prompt when using “A baby wearing
a blue shirt Initial Inversion
text-guided diffusion models. By injecting internal cross- lying on the sofa.” DM
attention maps, they preserve the spatial layout and geome-
try which enable the regeneration of an image while modi- g
n in DM
fying it through prompt editing. Still, without an inversion “”

tu
technique, their approach is limited to synthesized images. null-text Final Inversion

Sheynin et al. [33] suggest training the model for local edit-
Figure 3. Null-text Inversion overview. Top: pivotal inversion.
ing without the inversion requirement, but their expressive-
We first apply an initial DDIM inversion on the input image which
ness and quality are inferior compared to current large-scale
estimates a diffusion trajectory {zt∗ }T0 . Starting the diffusion pro-
diffusion models. Concurrent to our work, DiffEdit [9] uses cess from the last latent zT∗ results in unsatisfying reconstruction
DDIM inversion for image editing, but avoids the emerged as the latent codes become farther away from the original trajec-
distortion by automatically producing a mask that allows tory. We use the initial trajectory as a pivot for our optimization
background preservation. which brings the diffusion backward trajectory {z̄t }T1 closer to
Also concurrent, Imagic [19] and UniTune [39] have the original image encoding z0∗ . Bottom: null-text optimization
demonstrated impressive editing results using the powerful for timestamp t. Recall that classifier-free guidance consists of
Imagen model [32]. Yet, they both require the restrictive performing the prediction θ twice – using text condition embed-
fine-tuning of the model. Moreover, Imagic requires a new ding and unconditionally using null-text embedding ∅ (bottom-
tuning for each editing, while UniTune involves a parame- left). Then, these are extrapolated with guidance scale w (middle).
We optimize only the unconditional embeddings ∅t by employing
ter search for each image. Our method enables us to apply
a reconstruction MSE loss (in red) between the predicated latent
the text-only intuitive editing of Prompt-to-Prompt [16] on ∗
code zt−1 to the pivot zt−1 .
real images. We do not require any fine-tuning and provide
highly-quality local and global modifications using the pub-
licly available Stable Diffusion [30] model. ing the tuning of the model and the conditional embedding.
Thereby preserving the desired editing capabilities.
3. Method Next, we provide a short background, followed by a de-
tailed description of our approach in Sec. 3.2 and Sec. 3.3.
Let I be a real image. Our goal is to edit I, using only A general overview is provided in Fig. 3.
text guidance, to get an edited image I ∗ . We use the set-
ting defined by Prompt-to-Prompt [16], where the editing 3.1. Background and Preliminaries
is guided by source prompt P and edited prompt P ∗ . This Text-guided diffusion models aim to map a random
requires the user to provide a source prompt. Yet, we found noise vector zt and textual condition P to an output image
that automatically producing the source prompt using an z0 , which corresponds to the given conditioning prompt.
off-the-shelf captioning model [24] works well (see Sec. 4). In order to perform sequential denoising, the network εθ is
For example, see Fig. 2, given an image and a source prompt trained to predict artificial noise, following the objective:
”A baby wearing...”, we replace the baby with a robot by 2
min Ez0 ,ε∼N (0,I),t∼Uniform(1,T ) kε − εθ (zt , t, C)k2 . (1)
providing the edited prompt ”A robot wearing...”. θ
Such editing operations first require inverting I to the Note that C = ψ(P) is the embedding of the text condition
model’s output domain. Namely, the main challenge is and zt is a noised sample, where noise is added to the sam-
faithfully reconstructing I by feeding the source prompt P pled data z0 according to timestamp t. At inference, given a
to the model, while still retaining the intuitive text-based noise vector zT , The noise is gradually removed by sequen-
editing abilities. tially predicting it using our trained network for T steps.
Our approach is based on two main observations. First, Since we aim to accurately reconstruct a given real
DDIM inversion produces unsatisfying reconstruction when image, we employ the deterministic DDIM sampling [35]:
s !
classifier-free guidance is applied, but provides a good start-
r r
αt−1 1 1
ing point for the optimization, enabling us to efficiently zt−1 = zt + −1− − 1 ·εθ (zt , t, C).
αt αt−1 αt
achieve high-fidelity inversion. Second, optimizing the un-
conditional null embedding, which is used in classifier-free For the definition of αt and additional details, please refer
guidance, allows an accurate reconstruction while avoid- to Appendix E. Diffusion models often operate in the im-
age pixel space where z0 is a sample of a real image. In

3
25
Null-text (ours)
Random caption
20 Global null-text
PSNR

Random pivot
15 Textual inversion
VQAE reconstruction
DDIM inversion
10

50 250 500 750 1000

0:20m 1:03m 1:57m 2:51m 3:45m
Number of iterations
Input caption: “A black dinning room table sitting in a yellow dinning room.”

Input Image DDIM inversion VQAE reconstruction Textual inversion Random pivot Global null-text Random caption Null-text (ours)

Figure 4. Ablation Study. Top: we compare the performance of our full algorithm (green line) to different variations, evaluating the
reconstruction quality by measuring the PSNR score as a function of number optimization iterations and running time in minutes. Bottom:
we visually show the inversion results after 200 iterations of our full algorithm (on right) compared to other baselines. Results for all
iterations are shown in Appendix B (Figs. 13 and 14).

our case, we use the popular and publicly available Stable is inefficient as inference requires only a single noise vec-
Diffusion model [30] where the diffusion forward process tor. Instead, inspired by GAN literature [29], we seek to
is applied on a latent image encoding z0 = E(x0 ) and an perform a more ”local” optimization, ideally using only a
image decoder is employed at the end of the diffusion back- single noise vector. In particular, we aim to perform our
ward process x0 = D(z0 ). optimization around a pivotal noise vector which is a good
approximation and thus allows a more efficient inversion.
Classifier-free guidance. One of the key challenges in We start by studying the DDIM inversion. In practice,
text-guided generation is the amplification of the effect a slight error is incorporated in every step. For uncondi-
induced by the conditioned text. To this end, Ho et al. [18] tional diffusion models, the accumulated error is negligi-
have presented the classifier-free guidance technique, ble and the DDIM inversion succeeds. However, recall that
where the prediction is also performed unconditionally, meaningful editing using the Stable Diffusion model [30]
which is then extrapolated with the conditioned prediction. requires applying classifier-free guidance with a large guid-
More formally, let ∅ = ψ(””) be the embedding of a null ance scale w > 1. We observe that such a guidance scale
text and let w be the guidance scale parameter, then the amplifies the accumulated error. Therefore, performing the
classifier-free guidance prediction is defined by: DDIM inversion procedure with classifier-free guidance re-
ε̃θ (zt , t, C, ∅) = w · εθ (zt , t, C) + (1 − w) · εθ (zt , t, ∅). sults not only in visual artifacts, but the obtained noise vec-
E.g., w = 7.5 is the default parameter for Stable Diffusion. tor might be out of the Gaussian distribution. The latter
decreases the editability, i.e., the ability to edit using the
DDIM inversion. A simple inversion technique was particular noise vector.
suggested for the DDIM sampling [13, 35], based on the We do recognize that using DDIM inversion with guid-
assumption that the ODE process can be reversed in the ance scale w = 1 provides a rough approximation of the
limit of small steps: original image which is highly editable but far from accu-
rate. More specifically, the reversed DDIM produces a T
r s r !
αt+1 1 1
zt+1 = zt + −1− − 1 ·εθ (zt , t, C). steps trajectory between the image encoding z0 to a Gaus-
αt αt+1 αt sian noise vector zT∗ . Again, a large guidance scale is essen-
In other words, the diffusion process is performed in the tial for editing. Hence, we focus on feeding zT∗ to the diffu-
reverse direction, that is z0 → zT instead of zT → z0 , sion process with classifier-free guidance (w > 1). This re-
where z0 is set to be the encoding of the given real image. sults in high editability but inaccurate reconstruction, since
the intermediate latent codes deviate from the trajectory, as
3.2. Pivotal Inversion illustrated in Fig. 3. Analysis of different guidance scale
Recent inversion works use random noise vectors for values for the DDIM inversion is provided in Appendix B
each iteration of their optimization, aiming at mapping ev- (Fig. 9).
ery noise vector to a single image. We observe that this Motivated by the high editability, we refer to this initial
DDIM inversion with w = 1 as our pivot trajectory and

4
optimization. Namely, for each input image, we optimize
only the unconditional embedding ∅, initialized with the
null-text embedding. The model and the conditional textual
embedding are kept unchanged.
This results in high-quality reconstruction while still al-
Input Image Modifed caption: “A girl sitting in a dry field.”
lowing intuitive editing with Prompt-to-Prompt [16] by sim-
ply using the optimized unconditional embedding. More-
over, after a single inversion process, the same uncondi-
tional embedding can be used for multiple editing opera-
tions over the input image. Since null-text optimization is
naturally less expressive than fine-tuning the entire model,
Input Image Modifed caption: “A living room with a zebra dense pattern couch and pillows”
it requires the more efficient pivotal inversion scheme.
Figure 5. Fine control editing using attention re-weighting. We We refer to optimizing a single unconditional embedding
can use attention re-weighting to further control the level of dry- ∅ as a Global null-text optimization. During our experi-
ness over the field or create a denser zebra pattern over the couch. ments, as shown in Fig. 4, we have observed that optimiz-
ing a different ”null embedding” ∅t for each timestamp t
perform our optimization around it with the standard guid- significantly improves the reconstruction quality while this
ance scale, w > 1. That is, our optimization maximizes is well suited for our pivotal inversion. And so, we use per-
the similarity to the original image while maintaining our timestamp unconditional embeddings {∅t }Tt=1 , and initial-
ability to perform meaningful editing. In practice, we ize ∅t with the embedding of the previous step ∅t+1 .
execute a separate optimization for each timestamp t in the Putting the two components together, our full algorithm
order of the diffusion process t = T → t = 1 with the is presented in algorithm 1. The DDIM inversion with
objective of getting close as possible to the initial trajectory w = 1 outputs a sequence of noisy latent codes zT∗ , . . . , z0∗
zT∗ , . . . , z0∗ : where z0∗ = z0 . We initialize z¯T = zt , and perform the
∗ 2 following optimization with the default guidance scale
min zt−1 − zt−1 2
, (2)
w = 7.5 for the timestamps t = T, . . . , 1, each for N
where zt−1 is the intermediate result of the optimization. iterations: 2
∗
Since our pivotal DDIM inversion provides a rather good min zt−1 − zt−1 (z¯t , ∅t , C) 2 . (3)
∅t
starting point, this optimization is highly efficient compared For simplicity, zt−1 (z¯t , ∅t , C) denotes applying DDIM
to using random noise vectors, as demonstrated in Sec. 4. sampling step using z¯t , the unconditional embedding ∅t ,
Note that for every t < T , the optimization should start and the conditional embedding C. At the end of each step,
from the endpoint of the previous step (t + 1) optimization, we update
otherwise our optimized trajectory would not hold at infer- z̄t−1 = zt−1 (z¯t , ∅t , C).
ence. Therefore, after the optimization of step t, we com- We find that early stopping reduces time consumption, re-
pute the current noisy latent z̄t , which is then used in the sulting in ∼ 1 minute using a single A100 GPU.
optimization of the next step to ensure our new trajectory Finally, we can edit the real input image by using the
would end near z0 (see Eq. (3) for more details). noise z¯T = zT∗ and the optimized unconditional embed-
3.3. Null-text optimization dings {∅t }Tt=1 . Please refer to Appendix D for additional
implementation details.
To successfully invert real images into the model’s do-
main, recent works optimize the textual encoding [15], the 4. Ablation Study
network’s weights [31, 39], or both [19]. Fine-tuning the
In this section, we validate the contribution of our main
model’s weight for each image involves duplicating the en-
components, thoroughly analyzing the effectiveness of our
tire model which is highly inefficient in terms of memory
design choices by conducting an ablation study. We focus
consumption. Moreover, unless fine-tuning is applied for
on the fidelity to the input image which is an essential eval-
each and every edit, it necessarily hurts the learned prior
uation for image editing. In Sec. 5 we demonstrate that
of the model and therefore the semantics of the edits. Di-
our method performs high-quality and meaningful manip-
rect optimization of the textual embedding results in a non-
ulations.
interpretable representation since the optimized tokens does
not necessarily match pre-existing words. Therefore, an in- Experimental setting. Evaluation is provided in Fig. 4.
tuitive prompt-to-prompt edit becomes more challenging. We have used a subset of 100 images and captions pairs,
Instead, we exploit the key feature of the classifier-free randomly selected from the COCO [8] validation set. We
guidance — the result is highly affected by the uncondi- then applied our approach on each image-caption pair us-
tional prediction. Therefore, we replace the default null-text ing the default Stable Diffusion hyper-parameters for an
embedding with an optimized one, referred to as null-text increasing number of iterations per diffusion step, N =

5
Input Text2LIVE VQGAN+CLIP SDEdit Ours
Algorithm 1: Null-text inversion
1 Input: A source prompt embedding C = ψ(P) and
input image I.
2 Output: Noise vector zT and optimized
embeddings {∅t }Tt=1 . ”A baby holding her monkey zebra doll.”

3 Set guidance scale w = 1;

4 Compute the intermediate results zT∗ , . . . , z0∗ using
DDIM inversion over I;
5 Set guidance scale w = 7.5; ”A blue bicycle is parking on the side of the street”
6 Initialize z¯T ← zT∗ , ∅T ← ψ(””);
7 for t = T, T − 1, . . . , 1 do
8 for j = 0, . . . , N − 1 do
∗ 2
9 ∅t ← ∅t −η∇∅ zt−1 − zt−1 (z¯t , ∅t , C) 2 ;
”A girl sitting in a field boat”
10 end
11 Set z̄t−1 ← zt−1 (z¯t , ∅t , C), ∅t−1 ← ∅t ;
12 end
13 Return z¯T , {∅t }Tt=1
”A child tiger is climbing on a tree””

Figure 6. Comparison. Text2LIVE [6] excels at replacing textures

1, . . . , 20 (see algorithm 1). The reconstruction quality was locally but struggles to perform more structured editing, such as
measured in terms of mean PSNR. We now turn to analyze replacing a kid with a tiger. VQGAN+CLIP [11] obtains inferior
different variations of our algorithm. realism. SDEdit [23] fails to faithfully reconstruct the original
image, resulting in identity drift when humans are involved. Our
DDIM inversion. We mark the DDIM inversion as a method achieves realistic editing of both textures and structured
lower bound for our algorithm, as it is the starting point objects while retaining high fidelity to the original image. Addi-
of our optimization, producing unsatisfying reconstruction tional examples provided in Appendix C (Fig. 16).
when classifier-free guidance is applied (see Sec. 3.2).
VQAE. For an upper bound, we consider the reconstruc- our method is highly sensitive to the chosen caption. We
tion using the VQ auto-encoder [14], denoted VQAE, which take this to the extreme by sampling a random caption from
is used by the Stable Diffusion model. Although the latent the dataset for each image. Even with unaligned captions,
code or the VQ-decoder can be further optimized according the optimization converges to an optimal reconstitution with
to the input image, this is out of our scope, since it would be respect to the VQAE. Therefore, we conclude that our in-
only applicable to this specific model [30] while we aim for version is robust to the input caption. Clearly, choosing a
a general algorithm. Therefore, our optimization treats its random caption is undesired for text-based editing. But,
encoding z0 as ground truth, as the obtained reconstruction providing any reasonable and editable prompt would work,
is quite accurate in most cases. including using an off-the-shelve captioning model [24,37].
Our method. As can be seen in Fig. 4, our method con- This is illustrated in Appendix B (Fig. 12). We invert an im-
verges to a near-optimal reconstruction with respect to the age using multiple captions, demonstrating that the edited
VQAE upper bound after a total number of 500 iterations parts should be included in the source caption in order to
(N = 10) and even after 250 iterations (∼ 1 minute on an produce semantic attention maps for editing. For example,
A100 GPU) we achieve high-quality inversion. to edit the print on the shirt, the source caption should in-
clude a ”shirt with a drawing” or a similar phrase.
Random Pivot. We validate the importance of the DDIM
initialization by replacing the DDIM-based trajectory with Global null-text embedding. We refer to optimizing a
a single random trajectory of latent codes, sharing the single embedding ∅ for all timestamps as a Global em-
same starting point z0 — the input image encoding. In bedding. As can be seen, such optimization struggles to
other words, we randomly sample a single Gaussian noise converge, since it is less expressive than our final approach,
∼ N (0, I) for each image and use it to noise the corre- which uses embedding per-timestamp {∅t }Tt=1 . See addi-
sponding encoding z0 from t = 1 to t = T using the dif- tional implementation details in Appendix D.
fusion scheduler. As presented in Fig. 4, the DDIM initial- Textual inversion. We compare our method to textual
ization is crucial for fast convergence, since the initial error inversion, similar to the proposed method by Gal et al. [15].
becomes significantly larger when the pivot is random. We optimize the textual embedding C = ψ(P) using
Robustness to different input captions. Since we re- random noise samples instead of pivotal inversion. That is,
quire an input caption, it is only natural to ask whether we randomly sample a different Gaussian noise for each

6
Input caption: “Two crochet birds sitting on a branch.”

Table 1. User study results. The participants were asked to select

the best editing result in terms of fidelity to both the input image
and the textual edit instruction.

Input caption: “A basket with apples kittens on a chair.” VQGAN+CLIP Text2Live SDEDIT Ours
3.8% 16.6% 14.5% 65.1%

were constrained to synthesized images are now applied to

real images using our inversion technique.
Input Image + Mask Blended–Diffusion Glide SD Inpainting Ours
As can be seen in Fig. 2, our method effectively mod-
Figure 7. Comparison to mask-based methods . As can be seen, ifies both textures (”floral shirt”) and structured objects
mask-based methods do not require inversion as the region outside (”baby” to ”robot”). Since we support the local editing
the mask is kept. However, unlike our approach, such methods
of Prompt-to-Prompt and achieve high-fidelity reconstruc-
often struggle to preserve details that are found inside the masked
tion, the original identity is well preserved, even in the chal-
region. For example, basket size is not preserved.
lenging case of a baby face. Fig. 2 also illustrates that our
optimization step and obtain zt by adding it to z0 according method requires only a single inversion process to perform
to the diffusion scheduler. Intuitively, this objective aims to multiple editing operations. Using a single inversion proce-
map all noise vectors to a single image, in contrast to our dure, we can modify hair color, glasses, expression, back-
pivotal tuning inversion which focuses on a single trajectory ground, and lighting and even replace objects or put on a
of noisy vectors. The optimization objective is then defined: joker make-up (bottom rows). Using Prompt-to-Prompt, we
2 can also attenuate or amplify the effect of a specific word
min Ez0 ,ε∼N (0,I),t kε − εθ (zt , t, C)k2 . (4) over real images, as appeared in Fig. 5. For additional ex-
C
Note that Gal et al. [15] have attempted to regenerate a spe- amples, please refer to Appendix C.
cific object rather than achieve an accurate inversion. As Visual results for our high-fidelity reconstruction are pre-
presented in Fig. 4, the convergence is much slower than sented in Figs. 1 and 8, and Appendix C (Fig. 16), support-
ours and results in poor reconstruction quality. ing our quantitative measures in Sec. 4.
Textual inversion with a pivot. We observe that employ- 5.1. Comparisons
ing our pivotal inversion with the mentioned textual inver-
sion improves the reconstruction quality significantly, re- Our method aims attention at intuitive editing using only
sults in a comparable reconstruction to ours. This further text, and so we compare our results to other text-only edit-
demonstrates the power of performing the optimization using methods: (1) VQGAN+CLIP [11], (2) Text2Live [6], and
ing a pivot. However, we do observe that editability is re- (3) SDEedit [23]. We evaluated these on the images used
duced compared to the null-text optimization. In particu- by Bar-Tal et al. [6] and photos that include more structured
lar, as demonstrated in Appendix B (Fig. 15), the attention objects, such as humans and animals, which we gathered
maps are less accurate which decreases the performance of from the internet. In total, we use 100 samples of images,
Prompt-to-prompt editing. input captions, and edited captions.
We also compare our method to the mask-based methods
Null-text optimization without pivotal inversion. We of (4) Glide [25], (5) Blended-Diffusion [5], and (6) Stable
observe that optimizing the unconditional null-text embed- Diffusion Inpaint [30]. The latter fine-tunes the diffusion
ding using random noise vectors, instead of pivotal inver- model using an inpainting objective, allowing simple edit-
sion as described in previous paragraphs, completely breaks ing by inpainting a masked region using a target prompt.
the null-text optimization. The results are inferior even to Lastly, we consider the concurrent work of (7) Imagic
the DDIM inversion baseline as presented in Appendix B [19], which employs model-tuning per editing operation
(Figs. 13 and 14). We hypothesize that null-text optimiza- and has been designed for the Imagen model [32]. We
tion is less expressive than model-tuning and thus depends refrain from comparing to the concurrent works of Uni-
on the efficient pivotal inversion, as it struggles to map all tune [39] and DiffEdit [9] as there are no available imple-
noise vectors to a single image. mentations.
5. Results Qualitative Comparison. As presented in Fig. 6, VQ-
GAN+CLIP [11] mostly produces unrealistic results.
Real image editing is presented in Figs. 1, 2 and 5, show- Text2LIVE [6] handles texture modification well but fails
ing our method not only reaches remarkable reconstruction to manipulate more structured objects, e.g., placing a boat
quality but also retains high editability. In particular, we (3rd row). Both struggle due to the use of CLIP [26] which
use the intuitive approach of Prompt-to-Prompt [16] and lacks a generative ability. In SDEdit [23], the noisy image
demonstrate that the editing capabilities which previously is fed to an intermediate step in the diffusion process, and
therefore, it struggles to faithfully reconstruct the original

7
details. This results in severe artifacts when fine details are 5.2. Evaluating Additional Editing Technique
involved, such as human faces. For instance, identity drifts
Most of the presented results consist of applying our
in the top row, and the background is not well preserved in
method with the editing technique of Prompt-to-Prompt
the 2nd row. Contrarily, our method successfully preserves
[16]. However, we demonstrate that our method is not con-
the original details, while allowing a wide range of realistic
fined to a specific editing approach, by showing it improves
and meaningful editing, from simple textures to replacing
the results of the SDEdit [23] editing technique.
well-structured objects.
In Fig. 8 (top), we measure the fidelity to the original im-
Fig. 7 presents a comparison to mask-based methods,
age using LPIPS perceptual distance [43] (lower is better),
showing these struggles to preserve details that are found
and the fidelity to the target text using CLIP similarity [26]
inside the masked region. This is due to the masking pro-
(higher is better) over 100 examples. We use different val-
cedure that removes important structural information, and
ues of the SDEdit parameter t0 (marked on the curve), i.e.,
therefore, some capabilities are out of the inpainting reach.
we start the diffusion process from different t = t0 · T us-
A comparison to Imagic [19], which operates in a dif-
ing a correspondingly noised input image. This parameter
ferent setting – requiring model-tuning for each editing op-
controls the trade-off between fidelity to the input image
eration, is provided in Appendix B (Fig. 17). We first em-
(low t0 ) and alignment to the text (high t0 ). We compare
ploy the unofficial Imagic implementation for Stable Diffu-
the standard SDEdit to first applying our inversion and then
sion and present the results for different values of the in-
performing SDEdit while replacing the null-text embedding
terpolation parameter α = 0.6, 0.7, 0.8, 0.9. This param-
with our optimized embeddings. As shown, our inversion
eter is used to interpolate between the target text embed-
significantly improves the fidelity to the input image.
ding and the optimized one [19]. In addition, the Imagic
This is visually demonstrated in Fig. 8 (bottom). Since
authors applied their method using the Imagen model over
the parameter t0 controls a reconstruction-editability trade-
the same images, using the following parameters α =
off, we have used a different parameter for each method
0.93, 0.86, 1.08. As can be seen, Imagic produces highly
(SDEdit with and without our inversion) such that both
meaningful editing, especially when the Imagen model is
achieve the same CLIP score. As can be seen, when using
involved. However, Imagic struggles to preserve the origi-
our method, the true identity of the baby is well preserved.
nal details, such as the identity of the baby (1st row) or cups
in the background (2nd row). Furthermore, we observe that
Imagic is quite sensitive to the interpolation parameter α, as
6. Limitations
a high value reduces the fidelity to the image and a low value While our method works well in most scenarios, it still
reduces the fidelity to the text guidance, while a single value faces some limitations. The most notable one is inference
cannot be applied to all examples. Lastly, Imagic takes a time. Our approach requires approximately one minute
longer inference time, as shown in Appendix C (Tab. 2). on GPU for inverting a single image. Then, infinite edit-
Quantitative Comparison. Since ground truth is not ing operations can be made, each takes only ten seconds.
available for text-based editing of real images, quantitative This is not enough for real-time applications. Other limita-
evaluation remains an open challenge. Similar to [6, 16], tions come from using Stable Diffusion [30] and Prompt-to-
we present a user study in Tab. 1. 50 participants have Prompt editing [16]. First, the VQ auto-encoder produces
rated a total of 48 images for each baseline. The partic- artifacts in some cases, especially when human faces are in-
ipants were recruited using Prolific (prolific.co). We pre- volved. We consider the optimization of the VQ decoder
sented side-by-side images produced by: VQGAN+CLIP, as out of scope here, since this is specific to Stable Dif-
Text2LIVE, SDEdit, and our method (in random order). We fusion and we aim for a general framework. Second, we
focus on methods that share a similar setting to ours – no observe that the generated attentions maps of Stable Dif-
model tuning and mask requirement. The participants were fusion are less accurate compared to the attention maps
asked to choose the method that better applies the requested of Imagen [32], i.e., words might not relate to the correct
edit while preserving most of the original details. A print region, indicating inferior text-based editing capabilities.
screen is provided in Appendix F (Fig. 18). As shown in Lastly, complicated structure modifications are out of reach
Tab. 1, most participants favored our method. for Prompt-to-Prompt, such as changing a seating dog to a
Quantitative comparison to Imagic is presented in Ap- standing one as in [19]. Our inversion approach is orthog-
pendix B (Fig. 11), using the unofficial Stable Diffusion onal to the specific model and editing techniques, and we
implementation. According to these measures, our method believe that these will be improved in the near future.
achieves better scores for LPIPS perceptual distance, indi-
cating a better fidelity to the input image. 7. Conclusions
We have presented an approach to invert real images with
corresponding captions into the latent space of a text-guided
diffusion model while maintaining its powerful editing ca-

8
SDEdit 90%
SDEdit + Ours 80%
0.6
70%
providing us with their support for the Imagic [19] compari-
90%
60% son. We also thank Jay Tenenbaum for the help with writing
80%
the background.
LPIPS

50%
40%
0.4 30% 70%
20% 60%
References
10% 50%
40% [1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
0.2 20% 30%
10% age2stylegan: How to embed images into the stylegan latent
0.26 0.28 0.30 0.32
space? In Proceedings of the IEEE/CVF International Con-
CLIPScore ference on Computer Vision, pages 4432–4441, 2019. 2
Input Image Our Inversion SDEdit Ours + SDEdit Ours + P2P
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
age2stylegan++: How to edit the embedded images? In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 8296–8305, 2020. 2
[3] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and
”Macaroni cake on a table.”
Amit H. Bermano. Hyperstyle: Stylegan inversion with hy-
pernetworks for real image editing, 2021. 2
[4] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2
[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
”A baby wearing a blue shirt lying on the sofa beach”
diffusion for text-driven editing of natural images. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
Figure 8. Our method improves SDEdit results. Top: we evalu- and Pattern Recognition, pages 18208–18218, 2022. 2, 7
ate SDEdit with and without applying null-text inversion. In each [6] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas-
measure, a different SDEdit parameter is used, i.e., different per- ten, and Tali Dekel. Text2live: Text-driven layered image
cent of diffusion steps are applied over the noisy image (marked and video editing. arXiv preprint arXiv:2204.02491, 2022.
on the curve). We measure both fidelity to the original image (via 3, 6, 7, 8, 11, 12
LPIPS, low is better) and fidelity to the target text (via CLIP, high
[7] Amit H Bermano, Rinon Gal, Yuval Alaluf, Ron Mokady,
is better). Bottom, from left to right: input image, null-text inver-
Yotam Nitzan, Omer Tov, Oren Patashnik, and Daniel
sion, SDEdit, applying SDEdit after null-text inversion, and ap-
Cohen-Or. State-of-the-art in the architecture, methods and
plying Prompt-to-Prompt after null-text inversion. As can be seen,
applications of stylegan. In Computer Graphics Forum, vol-
our inversion significantly improves the fidelity to the original im-
ume 41, pages 591–611. Wiley Online Library, 2022. 1, 2
age when applied before SDEdit.
[8] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick.
pabilities. Our two-step approach first uses DDIM inversion
Microsoft coco captions: Data collection and evaluation
to compute a sequence of noisy codes, which roughly ap- server. arXiv preprint arXiv:1504.00325, 2015. 5
proximate the original image (with the given caption), then [9] Guillaume Couairon, Jakob Verbeek, Holger Schwenk,
uses this sequence as a fixed pivot to optimize the input null- and Matthieu Cord. Diffedit: Diffusion-based seman-
text embedding. Its fine optimization compensates for the tic image editing with mask guidance. arXiv preprint
inevitable reconstruction error caused by the classifier-free arXiv:2210.11427, 2022. 3, 7
guidance component. Once the image-caption pair is accu- [10] Antonia Creswell and Anil Anthony Bharath. Inverting the
rately embedded in the output domain of the model, prompt- generator of a generative adversarial network. IEEE transac-
to-prompt editing can be instantly applied at inference time. tions on neural networks and learning systems, 30(7):1967–
By introducing two new technical concepts to text-guided 1974, 2018. 2
diffusion models – pivotal inversion and null-text optimiza- [11] Katherine Crowson. Vqgan + clip, 2021. https://
tion, we were able to bridge the gap between reconstruc- colab . research . google . com / drive / 1L8oL -
tion and editability. Our approach offers a surprisingly sim- vLJXVcRzCFbPwOoMkPKJ8-aYdPN. 6, 7, 12
ple and compact means to reconstruct an arbitrary image, [12] Katherine Crowson, Stella Biderman, Daniel Kornis,
avoiding the computationally intensive model-tuning. We Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed-
ward Raff. Vqgan-clip: Open domain image generation
believe that null-text inversion paves the way for real-world
and editing with natural language guidance. arXiv preprint
use case scenarios for intuitive, text-based, image editing.
arXiv:2204.08583, 2022. 2, 11
[13] Prafulla Dhariwal and Alexander Nichol. Diffusion models
8. Acknowledgments beat gans on image synthesis. Advances in Neural Informa-
We thank Yuval Alaluf, Rinon Gal, Aleksander Holyn- tion Processing Systems, 34:8780–8794, 2021. 1, 4
ski, Bryan Eric Feldman, Shlomi Fruchter and David [14] Patrick Esser, Robin Rombach, and Björn Ommer. Taming
transformers for high-resolution image synthesis, 2020. 6
Salesin for their valuable inputs that helped improve this
[15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
work, and to Bahjat Kawar, Shiran Zada and Oran Lang for
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-

9
Or. An image is worth one word: Personalizing text-to- [31] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
image generation using textual inversion. arXiv preprint Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
arXiv:2208.01618, 2022. 2, 5, 6, 7 tuning text-to-image diffusion models for subject-driven
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, generation. arXiv preprint arXiv:2208.12242, 2022. 2, 5
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- [32] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
age editing with cross attention control. arXiv preprint Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
arXiv:2208.01626, 2022. 1, 2, 3, 5, 7, 8 Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Rapha Gontijo Lopes, Tim Salimans, Tim Salimans,
sion probabilistic models. Advances in Neural Information Jonathan Ho, David J Fleet, and Mohammad Norouzi. Pho-
Processing Systems, 33:6840–6851, 2020. 2, 13 torealistic text-to-image diffusion models with deep lan-
[18] Jonathan Ho and Tim Salimans. Classifier-free diffusion guage understanding. arXiv preprint arXiv:2205.11487,
guidance. In NeurIPS 2021 Workshop on Deep Generative 2022. 1, 2, 3, 7, 8
Models and Downstream Applications, 2021. 1, 4 [33] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
[19] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-Tang Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: diffusion: Image generation via large-scale retrieval. arXiv
Text-based real image editing with diffusion models. ArXiv, preprint arXiv:2204.02849, 2022. 3
abs/2210.09276, 2022. 2, 3, 5, 7, 8, 9, 11, 12, 19 [34] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[20] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- and Surya Ganguli. Deep unsupervised learning using
fusionclip: Text-guided diffusion models for robust image nonequilibrium thermodynamics. In International Confer-
manipulation. In Proceedings of the IEEE/CVF Conference ence on Machine Learning, pages 2256–2265. PMLR, 2015.
on Computer Vision and Pattern Recognition, pages 2426– 2, 13
2435, 2022. 2 [35] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[21] Gihyun Kwon and Jong Chul Ye. Clipstyler: Image ing diffusion implicit models. In International Conference
style transfer with a single text condition. arXiv preprint on Learning Representations, 2020. 1, 2, 3, 4, 13
arXiv:2112.00374, 2021. 2 [36] Yang Song and Stefano Ermon. Generative modeling by esti-
[22] Zachary C Lipton and Subarna Tripathi. Precise recovery of mating gradients of the data distribution. Advances in Neural
latent vectors from generative adversarial networks. arXiv Information Processing Systems, 32, 2019. 2
preprint arXiv:1702.04782, 2017. 2
[37] Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia
[23] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and
show to tell: a survey on deep learning-based image caption-
editing with stochastic differential equations. arXiv preprint
ing. IEEE Transactions on Pattern Analysis and Machine
arXiv:2108.01073, 2021. 2, 6, 7, 8, 12
Intelligence, 2022. 6
[24] Ron Mokady, Amir Hertz, and Amit H Bermano. Clip-
[38] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
cap: Clip prefix for image captioning. arXiv preprint
Daniel Cohen-Or. Designing an encoder for stylegan image
arXiv:2111.09734, 2021. 3, 6
manipulation. arXiv preprint arXiv:2102.02766, 2021. 2
[25] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and [39] Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv
Mark Chen. Glide: Towards photorealistic image generation Leviathan. Unitune: Text-driven image editing by fine tuning
and editing with text-guided diffusion models. arXiv preprint an image generation model on a single image. arXiv preprint
arXiv:2112.10741, 2021. 2, 7 arXiv:2210.09477, 2022. 2, 3, 5, 7
[26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [40] Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Qifeng Chen. High-fidelity gan inversion for image attribute
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- editing. ArXiv, abs/2109.06590, 2021. 2
ing transferable visual models from natural language super- [41] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei
vision. arXiv preprint arXiv:2103.00020, 2021. 3, 7, 8 Zhou, and Ming-Hsuan Yang. Gan inversion: A survey,
[27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, 2021. 1, 2
and Mark Chen. Hierarchical text-conditional image gen- [42] Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G.
eration with clip latents. arXiv preprint arXiv:2204.06125, Schwing, Mark Hasegawa-Johnson, and Minh N. Do. Se-
2022. 1, 2 mantic image inpainting with deep generative models, 2017.
[28] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, 2
Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding [43] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
in style: a stylegan encoder for image-to-image translation. man, and Oliver Wang. The unreasonable effectiveness of
arXiv preprint arXiv:2008.00951, 2020. 2 deep features as a perceptual metric. 2018 IEEE/CVF Con-
[29] Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel ference on Computer Vision and Pattern Recognition, pages
Cohen-Or. Pivotal tuning for latent-based editing of real im- 586–595, 2018. 8
ages. ACM Transactions on Graphics (TOG), 2022. 2, 4 [44] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
[30] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Alexei A Efros. Generative visual manipulation on the nat-
Patrick Esser, and Björn Ommer. High-resolution image syn- ural image manifold. In European conference on computer
thesis with latent diffusion models, 2021. 1, 2, 3, 4, 6, 7, 8, vision, pages 597–613. Springer, 2016. 2
12

10
×104
14
−2.28 Table 2. Inference time comparison. We measure both inversion
log likelihood of zT

−2.29 and editing time for different methods. SDEdit is faster than ours,

PSNR
−2.30
12
as an inversion is not employed by default, but fails to preserve the
−2.31 unedited parts. Our method is more efficient than the rest of the
−2.32 10
baselines, as it provides accurate reconstruction with faster inver-
2 4 6 8 2 4 6 8 sion time, while also allowing multiple editing operations after a
Guidance scale w Guidance scale w
single inversion.
(a) Edibility (b) Reconstruction
Method Inversion Editing Multiple edits
Figure 9. Setting the guidance scale for DDIM. We evaluate the
DDIM inversion with different values of the guidance scale. On VQGAN + CLIP — ∼ 1m No
left, we measure the log-likelihood of the latent vector zT with Text2Live — ∼ 9m No
respect to multivariate normal distribution. This estimates the ed- SDEdit — 10s Yes
itability as zT should ideally distribute normally and deviation Imagic ∼ 5m 10s No
from this distribution reduces our ability to edit the image. On Ours ∼ 1m 10s Yes
right, we measure the reconstruction quality using PSNR. As can
be seen, using a small guidance scale, such as w = 1, results in
better editability and reconstruction.
tion in order to produce semantic attention maps for these
(Fig. 12 bottom). For example, to edit the print on the shirt,
Appendix the source caption should include a ”shirt with a drawing”
term or a similar one.
A. Societal Impact
Our work suggests a new editing technique for manip- Null-text optimization without pivotal inversion. Opti-
ulating real images using state-of-the-art text-to-image dif- mizing the null-text embedding fails without the efficient
fusion models. This modification of real photos might be pivotal inversion. This is demonstrated in Fig. 13 and 14,
exploited by malicious parties to produce fake content in where the non-pivotal null-text optimization produces low-
order to spread disinformation. This is a known problem, quality reconstruction (2nd row).
common to all image editing techniques. However, research
in identifying and preventing malicious editing is already Textual inversion with a pivot. Fig. 15 illustrate per-
making significant progress. We believe our work would forming textual inversion around a pivot, i.e., similar to
contribute to this line of work, since we provide an analysis our pivotal inversion but optimizing the conditioned embed-
of the inversion and editing procedures using text-to-image ding. This results in a comparable reconstruction to ours, as
diffusion models. demonstrated in Fig. 15 (bottom), but editability is reduced.
By analyzing the attention maps (Fig. 15, top), observing
B. Ablation Study that these are less accurate than ours. For example, using
our null-text optimization, the attention referring to ”goats”
DDIM Inversion. To validate our selection of the guid-
is much more local, and attention referring to ”desert” is
ance scale parameter of w = 1 during the DDIM Inver-
more accurate. Consequently, editing the ”desert” results in
sion (see Algorithm 1, line 3, in the main text), we con-
artifacts over the goats (Fig. 15, bottom).
duct the DDIM inversion with different values of w from
1 to 8 using the same data as in Section 4. For each in-
version, we measure the log-likelihood of the result latent C. Additional results
image zT∗ ∈ R64×64×4 under the standard multivariate nor- Additional editing results of our method are provided in
mal distribution. Intuitively, to achieve high edibility we Fig. 10 and additional comparisons are provided in Fig. 16.
would like to maximize this term since during training zT∗
distributes normally. The mean log-likelihood as a function
Inference time comparison. As can be seen in Tab. 2,
of w is plotted in Fig. 9a. In addition, we measure the re-
SDEdit is the fastest since an inversion is not employed, but
construction with respect to the ground truth input image
as a result, it fails to preserve the details of the original im-
using the PSNR metric. As can be seen in Fig. 9b, increas-
age. Our method is more efficient than Text2Live [6], VQ-
ing the value of w results in less editable latent vector zT∗
GAN+CLIP [12] and Imagic [19], as it provides an accurate
and poorer initial reconstruction for our optimization, and
reconstruction in ∼ 1 minute, while also allowing multiple
therefore we use w = 1.
editing operations after a single inversion.

Robustness to different input captions. In Fig. 12 (top),

Comparison to Imagic Quantitative comparison to
we demonstrate our robustness to different input captions
Imagic is presented in Fig. 11, using the unofficial Stable
by successfully inverting an image using multiple captions.
Diffusion implementation. According to these measures,
Yet, the edited parts should be included in the source cap-

11
”A living room with a couch and pillows” 90%
Imagic (Unofficial)
SDEdit 80%
0.6 0.9
SDEdit + Ours 0.8 70%
90%
0.7 60%
80%

LPIPS
0.6 50%
40%
0.4 30%
70%

20% 60%
Input red velvet couch leather couch unicorn couch 50%
10%
40%
”close up of a giraffe eating a bucket” 20% 30%
10%
0.2

0.26 0.28 0.30 0.32

CLIPScore

Figure 11. Comparison to Imagic We quantitatively evaluate

Imagic using the unofficial implementation for Stable Diffusion.
Input giraffe → ”goat giraffe → ”robot bucket → ”basket We measure both fidelity to the original image (via LPIPS, low is
”A piece of cake” better) and fidelity to the target text (via CLIP, high is better). We
use different values of the text embedding interpolation parameter
α, marked on the curve. The high LPIPS perceptual distance indi-
cates that Imagic fails to retain high fidelity to the original image.

D. Implementation details
Input fish cake avocado cake Lego cake
In all of our experiments, we employ the Stable Diffu-
”A basket with apples on a chair” sion [30] using a DDIM sampler with the default hyperpa-
rameters: number of diffusion steps T = 50 and guidance
scale w = 7.5. Stable diffusion utilizes a pre-trained CLIP
network as the language model ψ. The null-text is tokenized
into start-token, end-token, and 75 non-text padding tokens.
Notice that the padding tokens are also used in CLIP and
Input apples→ puppies apples→ cookies cardboard basket
the diffusion model since both models do not use masking.
”A bicycle is parking on the side of the street” All inversion results except the ones in the ablation study
were obtained using N = 10 (See Algorithm 1 in the main
paper) and a learning rate of 0.01. We have used an early
stop parameter of = 1e − 5 such that the total inversion
for an input image and caption took 40s − 120s on a single
A100 GPU. Namely, for each timestamp t, we stop the op-
Input street→ beach snowy street street→ forest timization when the loss function value reaches = 1e − 5.
”two birds sitting on a branch”

Baseline Implementations. For the comparisons in sec-

tion 5, we use the official implementation of Text2Live* [6]
and VQGAN+CLIP† [11]. We have implemented the
SDEdit [23] method over Stable Diffusion based on the of-
ficial implementation‡ . We also compare our method to
Input branch→ rainbow Lego birds origami birds
Imagic [19] using an unofficial implementation§ (see Ap-
Figure 10. Additional editing results for our method. pendix C).

our method achieves better preservation of the original de- Global null-text Inversion. The algorithm for optimiz-
tails (lower LPIPS). This is also supported by the visual re- ing only a single null-text embedding ∅ for all timestamps
sults in Fig. 17, as Imagic struggles to accurately retain the is presented in algorithm 2. In this case, since the optimiza-
background. Furthermore, we observe that Imagic is quite tion of ∅ in a single timestamp affects all other timestamps,
sensitive to the interpolation parameter α, as a high value re- we change the order of the iterations in Algorithm 1. That
duces the fidelity to the image and a low value reduces the is, we perform N iterations in each we optimize ∅ for all the
fidelity to the text, while a single value cannot be applied diffusion timestamps by iterating over t. As shown in Sec-
to all examples. In addition, the authors of Imagic applied
* https://github.com/omerbt/Text2LIVE
their method on the same three images, presented in Fig. 17, † https://github.com/nerdyrodent/VQGAN-CLIP
using α = 0.93, 0.86, 1.08. This results in much better ‡ https://github.com/ermongroup/SDEdit
quality, however, still the background is not preserved. § https://github.com/ShivamShrirao/diffusers/tree/main/examples/imagic

12
tion 4, the convergence of this optimization is much slower with a parameterized Gaussian transition network pθ (xt−1 |
than our final method. More specifically, we found that only xt ) := N (xt−1 | µθ (xt , t), Σθ (xt , t)). The µθ (xt , t) can be
after 7500 optimization steps (about 30 minutes) the global replaced [17] by predicting the noise εθ (xt , t) added to x0
null-text inversion accurately reconstruct the input image. using equation 5.
We use Bayes’ theorem to approximate

Algorithm 2: Global null-text inversion 1 βt
µθ (xt , t) = √ xt − √ εθ (xt , t) . (6)
1 Input: A source prompt P and input image I.
αt 1 − αt
2 Output: Noise vector zT and an optimized Once we have a trained εθ (xt , t), we can using the follow-
embedding ∅ . ing sample method
3 Set guidance scale w = 1; xt−1 = µθ (xt , t) + σt z, z ∼ N (0, I). (7)
4 Compute the intermediate results zT∗ , . . . , z0∗ of
We can control σt of each sample stage, and in DDIMs [35]
DDIM inversion for image I;
the sampling process can be made deterministic using
5 Set guidance scale w = 7.5;
σt = 0 in all the steps. The reverse process can finally be
6 Initialize ∅ ← ψ(””);
trained by solving the following optimization problem:
7 for j = 0, . . . , N − 1 do 2
8 Set z¯T ← zT∗ ; min L(θ) := min Ex0 ∼q(x0 ),w∼N (0,I),t kw − εθ (xt , t)k2 ,
θ θ
9 for t = T, T − 1, . . . , 1 do
∗ 2 teaching the parameters θ to fit q(x0 ) by maximizing a vari-
10 ∅ ← ∅ − η∇∅ zt−1 − zt−1 (z¯t , ∅, C) 2
; ational lower bound.
Set z̄t−1 ← zt−1 (z¯t , ∅, C);
11 end F. User-Study
12 end
13 Return z¯T , ∅ An illustration of our user study is provided in Fig. 18

G. Image Attribution
E. Additional Background - Diffusion Models Girl in a field: https://unsplash.com/photos/
Diffusion Denoising Probabilistic Models (DDPM) [17, 1pCpWipo_jM
34] are generative latent variable models that aim to model Birds on a branch: https://pixabay.com/photos/
a distribution pθ (x0 ) that approximates the data distribu- sparrows-birds-perched-sperlings-3434123/
tion q(x0 ) and easy to sample from. DDPMs model a Basket with apples: https://unsplash.com/photos/
“forward process” in the space of x0 from data to noise. 4Bj27zMqNSE
This is called “forward” due to its procedure progress- Bicycle: https : / / unsplash . com / photos / vZAk _
ing from x0 to xT . Note that this process is a Markov n9Plfc
chain starting from x0 , where we gradually add noise to Child climbing: https : / / unsplash . com / photos /
the data to generate the latent variables x1 , . . . , xT ∈ oLZViCDG-dk
X. The sequence of latent Mountains: https://pixabay.com/photos/desert-
Qt variables, therefore, follows mountains-sky-clouds-peru-4842264/
q(x1 , . . . , xt | x0 ) = i=1 q(xt | xt−1 ), where a step
in the forward process is√defined as a Gaussian transition Giraffe: https://www.flickr.com/photos/tambako/
q(xt | xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) parameterized 30850708538/
by a schedule β0 , . . . , βT ∈ (0, 1). When T is large enough, Blue-haired woman in the forest: https://unsplash.com/
the last noise vector xT nearly follows an isotropic Gaussian photos/I3oRtzyBIFg
distribution. Dining table: https://cocodataset.org/#explore?
An interesting property of the forward process is that id=360849
one can express the latent variable xt directly as the Elephants: https://cocodataset.org/#explore?id=
following linear combination of noise and x0 without 345520
sampling intermediate latent vectors: Man with a doughnut: https : / / cocodataset . org /
√ √ #explore?id=360849
xt = αt x0 + 1 − αt w, w ∼ N (0, I), (5) Cake on a table: https://cocodataset.org/#explore?
Qt id=413699
where αt := i=1 (1 − βi ).
To sample from the distribution q(x0 ), we define the dual Piece of cake: https://cocodataset.org/#explore?
“reverse process” p(xt−1 | xt ) from isotropic Gaussian id=133063
noise xT to data by sampling the posteriors q(xt−1 | xt ).
Since the intractable reverse process q(xt−1 | xt ) depends
on the unknown data distribution q(x0 ), we approximate it

13
Input Image
Input caption: “A woman with a blue hair.”

Our Inversion “...smiling woman...” “...sad woman...” “...curly blue hair...” “...green hair...” woman squirrel woman storm trooper
Input caption: “A woman in the forest.”

Our Inversion “...forest at fall.” “...forest at winter.” forest city forest beach forest water park forest magic kingdom
Input caption: “A woman wearing a shirt with a drawing.”

Our Inversion “...long sleeves shirt...” “...turtle neck shirt...” “...red shirt...” “... drawing of kermit.” “...of cookie monster.” “...of inspector gadget.”

Cross-attention maps

Figure 12. Robustness to the input caption. We can invert an input image (top) using different input captions (first column). Naturally,
the selection of the caption effects the editing abilities with Prompt-to-Prompt, as can be seen in the visualization of the cross-attention
map (bottom). Yet, our method is not particularly sensitive to the exact wording of the prompt.

14
Input caption: “A black dinning room table sitting in a yellow dinning room.”

Input image DDIM invresion VQAE

Non Pivotal
Textual inversion
Random pivot
Global null-text
Random caption
Null-text (ours)

50 100 200 300 400 500 750 1000

Number of optimization iterations

Figure 13. Ablation study. We show the inversion results for an increasing number of optimization iterations. Our method achieves
high-quality reconstruction with fewer optimization steps.

15
Input caption: “Two people riding elephants in dirty deep water.”

Input image DDIM invresion VQAE

Non Pivotal
Textual inversion
Random pivot
Global null-text
Random caption
Null-text (ours)

50 100 200 300 400 500 750 1000

Number of optimization iterations

Figure 14. Ablation study. We show the inversion results for an increasing number of optimization iterations. Our method achieves
high-quality reconstruction with fewer optimization steps.

16
Attention maps of Text embedding optimization + Pivotal Inversion

Attention maps of null-text optimization

Input Inversion (T+P) Text + Pivot Ours Text + Pivot Ours

desert −→ forest ” desert −→ snow ”

Figure 15. Ablation study - Textual inversion with a pivot. We compare our method to replacing the null-text optimization with optimizing
the conditional (textual) embedding while still applying pivotal inversion. As can be seen (top), this results in less accurate attention maps,
and thus, in less accurate editing capabilities. In particular, textual inversion with a pivot achieves high-fidelity reconstruction (”Inversion
(T+P)”), but goat heads distort (bottom) when editing is applied due to the inaccurate attention maps.

17
Input Our Inversion Text2LIVE VQGAN+CLIP SDEdit Our Editing

”a bridge over a frozen waterfall”

”A golden bridge over a waterfall”

”A child monkey is climbing on a tree”

”A Landscape of Snowy mountains”

”A Landscape of mountains Tuscany”

””A pepperoni cake on a table””

Figure 16. Additional comparison results.

18
Input Imagic - Stable Diffusion with α = 0.6, 0.7, 0.8, 0.9 Imagic - Imagen Ours

”A baby holding her monkey lion doll”

”A spinach moss cake on a table”

”A piece of unicorn cake”

Figure 17. Comparison to Imagic [19]. We first employ the unofficial Imagic implementation for Stable Diffusion and present the results
for different values of the interpolation parameter α = 0.6, 0.7, 0.8, 0.9 (left to right). In addition, Imagic authors applied their method
using the Imagen model over the same images, using the parameters α = 0.93, 0.86, 1.08 (from top to bottom row). As can be seen, Imagic
produces highly meaningful editing, especially when the Imagen model is involved. However, Imagic struggles to preserve the original
details, such as the identity of the baby (1st row) or cups in the background (2nd row). Furthermore, we observe that each example requires
a separate tuning of the α parameter. Lastly, recall that each Imagic editing requires a separate tuning of the model.

19
Figure 18. User study print screen.

Collision Avoidance at Sea - Practice and Problems
No ratings yet
Collision Avoidance at Sea - Practice and Problems
10 pages
Control Diffusion
No ratings yet
Control Diffusion
20 pages
DDPMinversion Paper
No ratings yet
DDPMinversion Paper
20 pages
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
No ratings yet
Avrahami Blended Diffusion For Text-Driven Editing of Natural Images CVPR 2022 Paper
11 pages
S I I E R S D E: Emantic Mage Nversion and Diting Using Ectified Tochastic Ifferential Quations
No ratings yet
S I I E R S D E: Emantic Mage Nversion and Diting Using Ectified Tochastic Ifferential Quations
30 pages
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
No ratings yet
Dragondiffusion: Enabling Drag-Style Manipulation On Diffusion Models
10 pages
Prompt To Prompt - Preprint
No ratings yet
Prompt To Prompt - Preprint
36 pages
Multi AI Agents
No ratings yet
Multi AI Agents
19 pages
Ledits
No ratings yet
Ledits
21 pages
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
No ratings yet
Patashnik Localizing Object-Level Shape Variations With Text-to-Image Diffusion Models ICCV 2023 Paper
11 pages
2 SmartBrush Inpainting
No ratings yet
2 SmartBrush Inpainting
10 pages
Latent Space Editing in Transformer-Based Flow Matching
No ratings yet
Latent Space Editing in Transformer-Based Flow Matching
18 pages
The CLIP Model Is Secretly An Image-to-Prompt Converter
No ratings yet
The CLIP Model Is Secretly An Image-to-Prompt Converter
19 pages
Dit4Edit: Diffusion Transformer For Image Editing
No ratings yet
Dit4Edit: Diffusion Transformer For Image Editing
10 pages
Re Version
No ratings yet
Re Version
23 pages
04-Personalization&editing Paul Final
No ratings yet
04-Personalization&editing Paul Final
47 pages
Pivotal Tuning Inversion
No ratings yet
Pivotal Tuning Inversion
26 pages
A Style-Based GAN Encoder For High Fidelity
No ratings yet
A Style-Based GAN Encoder For High Fidelity
17 pages
Text To Video
No ratings yet
Text To Video
11 pages
Stable Flow: Vital Layers For Training-Free Image Editing
No ratings yet
Stable Flow: Vital Layers For Training-Free Image Editing
36 pages
Ai Paper 9
No ratings yet
Ai Paper 9
8 pages
Stable Diffusion 3 Paper
No ratings yet
Stable Diffusion 3 Paper
28 pages
Editar: Unified Conditional Generation With Autoregressive Models
No ratings yet
Editar: Unified Conditional Generation With Autoregressive Models
22 pages
Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper
No ratings yet
Liu Towards Understanding Cross and Self-Attention in Stable Diffusion For Text-Guided CVPR 2024 Paper
10 pages
Vector Quantized Diffusion Model For Text-to-Image Synthesis
No ratings yet
Vector Quantized Diffusion Model For Text-to-Image Synthesis
14 pages
2302 01329 PDF
No ratings yet
2302 01329 PDF
18 pages
G I - I E M L L M: Uiding Nstruction Based Mage Diting Via Ultimodal Arge Anguage Odels
No ratings yet
G I - I E M L L M: Uiding Nstruction Based Mage Diting Via Ultimodal Arge Anguage Odels
24 pages
IEEE Xplore Reference Download 2024.9.24.8.31.51
No ratings yet
IEEE Xplore Reference Download 2024.9.24.8.31.51
2 pages
LIME: Localized Image Editing Via Attention Regularization in Diffusion Models
No ratings yet
LIME: Localized Image Editing Via Attention Regularization in Diffusion Models
19 pages
Masactrl
No ratings yet
Masactrl
13 pages
NeurIPS 2020 Swapping Autoencoder For Deep Image Manipulation Paper
No ratings yet
NeurIPS 2020 Swapping Autoencoder For Deep Image Manipulation Paper
14 pages
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
No ratings yet
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
16 pages
SVDiff Compact Parameter Space For Diffusion Fine-Tuning
No ratings yet
SVDiff Compact Parameter Space For Diffusion Fine-Tuning
12 pages
Mo Dynamic Prompt Optimizing For Text-to-Image Generation CVPR 2024 Paper
No ratings yet
Mo Dynamic Prompt Optimizing For Text-to-Image Generation CVPR 2024 Paper
10 pages
Stitch It in Time: GAN-Based Facial Editing of Real Videos
No ratings yet
Stitch It in Time: GAN-Based Facial Editing of Real Videos
11 pages
Control 4 D
No ratings yet
Control 4 D
11 pages
CF Clip
No ratings yet
CF Clip
9 pages
Text-To-image Editing by Image Information Removal
No ratings yet
Text-To-image Editing by Image Information Removal
10 pages
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
No ratings yet
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
14 pages
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
No ratings yet
Esser Structure and Content-Guided Video Synthesis With Diffusion Models ICCV 2023 Paper
11 pages
Magicquill: An Intelligent Interactive Image Editing System
No ratings yet
Magicquill: An Intelligent Interactive Image Editing System
15 pages
ID Blau
No ratings yet
ID Blau
10 pages
Nonlinear Hierarchical Editing A Powerful Framework For Face Editing
No ratings yet
Nonlinear Hierarchical Editing A Powerful Framework For Face Editing
12 pages
Arbitrary Style Guidance For Enhanced Diffusion-Based Text-to-Image Generation
No ratings yet
Arbitrary Style Guidance For Enhanced Diffusion-Based Text-to-Image Generation
11 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
12 pages
Control Net
No ratings yet
Control Net
12 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
8 pages
Ip Adaptor
No ratings yet
Ip Adaptor
16 pages
Beyond Simple Edits: X-Planner For Complex Instruction-Based Image Editing
No ratings yet
Beyond Simple Edits: X-Planner For Complex Instruction-Based Image Editing
22 pages
Id Aligner
No ratings yet
Id Aligner
14 pages
Multimodal Image Synthesis and Editing The Generative AI Era
No ratings yet
Multimodal Image Synthesis and Editing The Generative AI Era
22 pages
2302.03011 Ai Video Creation
No ratings yet
2302.03011 Ai Video Creation
26 pages
On Discrete Prompt Optimization or Di Usion Models
No ratings yet
On Discrete Prompt Optimization or Di Usion Models
20 pages
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
No ratings yet
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
11 pages
Texture: Text-Guided Texturing of 3D Shapes
No ratings yet
Texture: Text-Guided Texturing of 3D Shapes
13 pages
2 DragDiffusion Inpainting
No ratings yet
2 DragDiffusion Inpainting
11 pages
Dream Booth
No ratings yet
Dream Booth
25 pages
Add-It: Training-Free Object Insertion in Images With Pretrained Diffusion Models
No ratings yet
Add-It: Training-Free Object Insertion in Images With Pretrained Diffusion Models
20 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
33 pages
2 EditingMasks 1
No ratings yet
2 EditingMasks 1
27 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Proclus, Metaphysical Elements
100% (4)
Proclus, Metaphysical Elements
230 pages
Balance and Movement
100% (2)
Balance and Movement
16 pages
Lecture 06
No ratings yet
Lecture 06
40 pages
Book-6 Innate Abilities Anyone Can Develop 1736419233
No ratings yet
Book-6 Innate Abilities Anyone Can Develop 1736419233
8 pages
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
No ratings yet
The - Role - of - Vibration - Monitoring - Schaeffler (UK) - (2009) PDF
20 pages
Brienza CV
No ratings yet
Brienza CV
18 pages
Arno and Thomas 2016
No ratings yet
Arno and Thomas 2016
11 pages
The Perception of Students Towards Entrepreneurship
100% (1)
The Perception of Students Towards Entrepreneurship
14 pages
Power of Personal Values
No ratings yet
Power of Personal Values
2 pages
Journal of Constructional Steel Research: Meng Wang, Yongjiu Shi, Yuanqing Wang, Gang Shi
No ratings yet
Journal of Constructional Steel Research: Meng Wang, Yongjiu Shi, Yuanqing Wang, Gang Shi
13 pages
The Convener Role
100% (1)
The Convener Role
3 pages
UAP DOC 203 - Specialized Allied Services
100% (1)
UAP DOC 203 - Specialized Allied Services
11 pages
Sinamics s120 Function Manual
No ratings yet
Sinamics s120 Function Manual
560 pages
2014 CMOST Presentation
100% (1)
2014 CMOST Presentation
86 pages
Activity-Based Costing: Learning Objectives
No ratings yet
Activity-Based Costing: Learning Objectives
44 pages
Mid-Year Review Form (MRF) For Teacher I-Iii: Department of Education
No ratings yet
Mid-Year Review Form (MRF) For Teacher I-Iii: Department of Education
11 pages
Dynamic Response of First Order Systems in Series
No ratings yet
Dynamic Response of First Order Systems in Series
15 pages
Chapter-5: Analog Filter Design
No ratings yet
Chapter-5: Analog Filter Design
7 pages
Tutorial Sheet 1 For Che 110 2023-2024 Intake
No ratings yet
Tutorial Sheet 1 For Che 110 2023-2024 Intake
2 pages
Law of Evidence Project - Presumptions
No ratings yet
Law of Evidence Project - Presumptions
19 pages
IES Data Sheet
No ratings yet
IES Data Sheet
2 pages
RESULT
No ratings yet
RESULT
9 pages
3rd Periodical Test
No ratings yet
3rd Periodical Test
4 pages
Name: Roxanne B. Magsipoc Date: 12/2/2020 Year and Section: BSA 1-3 Professor: Ms. Mary Camille Delima
No ratings yet
Name: Roxanne B. Magsipoc Date: 12/2/2020 Year and Section: BSA 1-3 Professor: Ms. Mary Camille Delima
3 pages
Assignment 1
No ratings yet
Assignment 1
5 pages
00 Release Notes Integra32-4.2
No ratings yet
00 Release Notes Integra32-4.2
16 pages
8
100% (6)
8
534 pages
Prickly Pear
100% (2)
Prickly Pear
83 pages
Frtool - The User's Guide: Frequency Response Controller Design Tool
No ratings yet
Frtool - The User's Guide: Frequency Response Controller Design Tool
21 pages

3 DDIM Inversion

Uploaded by

3 DDIM Inversion

Uploaded by

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Input Image DDIM Inversion Null-text Inversion Prompt-to-Prompt image editing

DDIM inversion, as pivot [29]. We then perform our opti-

50 250 500 750 1000

3 Set guidance scale w = 1;

Figure 6. Comparison. Text2LIVE [6] excels at replacing textures

Table 1. User study results. The participants were asked to select

were constrained to synthesized images are now applied to

Robustness to different input captions. In Fig. 12 (top),

0.26 0.28 0.30 0.32

Figure 11. Comparison to Imagic We quantitatively evaluate

Baseline Implementations. For the comparisons in sec-

Input image DDIM invresion VQAE

50 100 200 300 400 500 750 1000

Number of optimization iterations

Input image DDIM invresion VQAE

50 100 200 300 400 500 750 1000

Number of optimization iterations

Attention maps of null-text optimization

Input Inversion (T+P) Text + Pivot Ours Text + Pivot Ours

desert −→ forest ” desert −→ snow ”

”a bridge over a frozen waterfall”

”A golden bridge over a waterfall”

”A child monkey is climbing on a tree”

”A Landscape of Snowy mountains”

”A Landscape of mountains Tuscany”

””A pepperoni cake on a table””

”A baby holding her monkey lion doll”

”A spinach moss cake on a table”

”A piece of unicorn cake”

You might also like