3 DDIM Inversion
3 DDIM Inversion
Ron Mokady* † 1,2 , Amir Hertz* † 1,2 , Kfir Aberman1 , Yael Pritch1 , and Daniel Cohen-Or† 1,2
1
Google Research, 2 The Blavatnik School of Computer Science, Tel Aviv University
Input caption: “Zoom photo of flowers.” “...origami flowers...” flowers cupcakes “...wither flowers...” photo sketch
Input caption: “A cat sitting next to a mirror.” “...silver cat sculpture...” cat tiger “...sleeping cat...” “Watercolor drawing of...”
Figure 1. Null-text inversion for real image editing. Our method takes as input a real image (leftmost column) and an associated caption.
The image is inverted with a DDIM diffusion model to yield a diffusion trajectory (second column to the left). Once inverted, we use the
initial trajectory as a pivot for null-text optimization that accurately reconstructs the input image (third column to the left). Then, we can
edit the inverted image by modifying only the input caption using the editing technique of Prompt-to-Prompt [16] .
Abstract 1. Introduction
Recent text-guided diffusion models provide powerful The progress in image synthesis using text-guided diffu-
image generation capabilities. Currently, a massive effort sion models has attracted much attention due to their excep-
is given to enable the modification of these images using tional realism and diversity. Large-scale models [27, 30, 32]
text only as means to offer intuitive and versatile editing. To have ignited the imagination of multitudes of users, en-
edit a real image using these state-of-the-art tools, one must abling image generation with unprecedented creative free-
first invert the image with a meaningful text prompt into the dom. Naturally, this has initiated ongoing research efforts,
pretrained model’s domain. In this paper, we introduce an investigating how to harness these powerful models for im-
accurate inversion technique and thus facilitate an intuitive age editing. Most recently, intuitive text-based editing was
text-based modification of the image. Our proposed inver- demonstrated over synthesized images, allowing the user to
sion consists of two novel key components: (i) Pivotal in- easily manipulate an image using text only [16].
version for diffusion models. While current methods aim at However, text-guided editing of a real image with these
mapping random noise samples to a single input image, we state-of-the-art tools requires inverting the given image and
use a single pivotal noise vector for each timestamp and textual prompt. That is, finding an initial noise vector that
optimize around it. We demonstrate that a direct inver- produces the input image when fed with the prompt into
sion is inadequate on its own, but does provide a good an- the diffusion process while preserving the editing capabili-
chor for our optimization. (ii) null-text optimization, where ties of the model. The inversion process has recently drawn
we only modify the unconditional textual embedding that considerable attention for GANs [7,41], but has not yet been
is used for classifier-free guidance, rather than the input fully addressed for text-guided diffusion models. Although
text embedding. This allows for keeping both the model an effective DDIM inversion [13,35] scheme was suggested
weights and the conditional embedding intact and hence for unconditional diffusion models, it is found lacking for
enables applying prompt-based editing while avoiding the text-guided diffusion models when classifier-free guidance
cumbersome tuning of the model’s weights. Our null-text [18], which is necessary for meaningful editing, is applied.
inversion, based on the publicly available Stable Diffusion In this paper, we introduce an effective inversion scheme,
model, is extensively evaluated on a variety of images and achieving near-perfect reconstruction, while retaining the
prompt editing, showing high-fidelity editing of real images. rich text-guided editing capabilities of the original model
(see Fig. 1). Our approach is built upon the analysis of two
* Equal contribution. key aspects of guided diffusion models: classifier-free guid-
† Performed this work while working at Google. ance and DDIM inversion.
1
Input caption: “A baby wearing a blue shirt lying on the sofa.”
In the widely used classifier-free guidance, in each dif-
fusion step, the prediction is performed twice: once uncon-
ditionally and once with the text condition. These predic-
tions are then extrapolated to amplify the effect of the text
guidance. While all works concentrate on the conditional
prediction, we recognize the substantial effect induced by Input Image “... blond baby...” “... golden shirt...”
“... floral shirt...”
the unconditional part. Hence, we optimize the embedding
used in the unconditional part in order to invert the input
image and prompt. We refer to it as null-text optimization,
as we replace the embedding of the empty text string with
our optimized embedding.
DDIM Inversion consists of performing DDIM sampling “... sleeping baby...” “baby” “robot” “sofa” “grass” “sofa” “ball pit”
in reverse order. Although a slight error is introduced in Input caption: “A man in glasses eating a doughnut in the park.”
each step, this works well in the unconditional case. How-
ever, in practice, it breaks for text-guided synthesis, since
classifier-free guidance magnifies its accumulated error. We
observe that it can still offer a promising starting point
for the inversion. Inspired by GAN literature, we use the
sequence of noised latent codes, obtained from an initial Input Image “... red-haired man...” “glasses” “sunglasses” “angry man...”
2
DDIM Inversion
et al. [6] propose a text-based localized editing technique
without using any mask. Their technique allows high-
quality texture editing, but not modifying complex struc- Pivotal tuning
tures, since only CLIP [26] is employed as guidance instead Input Image
of a generative diffusion model.
Hertz et al. [16] suggest an intuitive editing technique,
called Prompt-to-Prompt, of manipulating local or global Pivotal Tuning by Null-text Optimization
details by modifying only the text prompt when using “A baby wearing
a blue shirt Initial Inversion
text-guided diffusion models. By injecting internal cross- lying on the sofa.” DM
attention maps, they preserve the spatial layout and geome-
try which enable the regeneration of an image while modi- g
n in DM
fying it through prompt editing. Still, without an inversion “”
tu
technique, their approach is limited to synthesized images. null-text Final Inversion
Sheynin et al. [33] suggest training the model for local edit-
Figure 3. Null-text Inversion overview. Top: pivotal inversion.
ing without the inversion requirement, but their expressive-
We first apply an initial DDIM inversion on the input image which
ness and quality are inferior compared to current large-scale
estimates a diffusion trajectory {zt∗ }T0 . Starting the diffusion pro-
diffusion models. Concurrent to our work, DiffEdit [9] uses cess from the last latent zT∗ results in unsatisfying reconstruction
DDIM inversion for image editing, but avoids the emerged as the latent codes become farther away from the original trajec-
distortion by automatically producing a mask that allows tory. We use the initial trajectory as a pivot for our optimization
background preservation. which brings the diffusion backward trajectory {z̄t }T1 closer to
Also concurrent, Imagic [19] and UniTune [39] have the original image encoding z0∗ . Bottom: null-text optimization
demonstrated impressive editing results using the powerful for timestamp t. Recall that classifier-free guidance consists of
Imagen model [32]. Yet, they both require the restrictive performing the prediction θ twice – using text condition embed-
fine-tuning of the model. Moreover, Imagic requires a new ding and unconditionally using null-text embedding ∅ (bottom-
tuning for each editing, while UniTune involves a parame- left). Then, these are extrapolated with guidance scale w (middle).
We optimize only the unconditional embeddings ∅t by employing
ter search for each image. Our method enables us to apply
a reconstruction MSE loss (in red) between the predicated latent
the text-only intuitive editing of Prompt-to-Prompt [16] on ∗
code zt−1 to the pivot zt−1 .
real images. We do not require any fine-tuning and provide
highly-quality local and global modifications using the pub-
licly available Stable Diffusion [30] model. ing the tuning of the model and the conditional embedding.
Thereby preserving the desired editing capabilities.
3. Method Next, we provide a short background, followed by a de-
tailed description of our approach in Sec. 3.2 and Sec. 3.3.
Let I be a real image. Our goal is to edit I, using only A general overview is provided in Fig. 3.
text guidance, to get an edited image I ∗ . We use the set-
ting defined by Prompt-to-Prompt [16], where the editing 3.1. Background and Preliminaries
is guided by source prompt P and edited prompt P ∗ . This Text-guided diffusion models aim to map a random
requires the user to provide a source prompt. Yet, we found noise vector zt and textual condition P to an output image
that automatically producing the source prompt using an z0 , which corresponds to the given conditioning prompt.
off-the-shelf captioning model [24] works well (see Sec. 4). In order to perform sequential denoising, the network εθ is
For example, see Fig. 2, given an image and a source prompt trained to predict artificial noise, following the objective:
”A baby wearing...”, we replace the baby with a robot by 2
min Ez0 ,ε∼N (0,I),t∼Uniform(1,T ) kε − εθ (zt , t, C)k2 . (1)
providing the edited prompt ”A robot wearing...”. θ
Such editing operations first require inverting I to the Note that C = ψ(P) is the embedding of the text condition
model’s output domain. Namely, the main challenge is and zt is a noised sample, where noise is added to the sam-
faithfully reconstructing I by feeding the source prompt P pled data z0 according to timestamp t. At inference, given a
to the model, while still retaining the intuitive text-based noise vector zT , The noise is gradually removed by sequen-
editing abilities. tially predicting it using our trained network for T steps.
Our approach is based on two main observations. First, Since we aim to accurately reconstruct a given real
DDIM inversion produces unsatisfying reconstruction when image, we employ the deterministic DDIM sampling [35]:
s !
classifier-free guidance is applied, but provides a good start-
r r
αt−1 1 1
ing point for the optimization, enabling us to efficiently zt−1 = zt + −1− − 1 ·εθ (zt , t, C).
αt αt−1 αt
achieve high-fidelity inversion. Second, optimizing the un-
conditional null embedding, which is used in classifier-free For the definition of αt and additional details, please refer
guidance, allows an accurate reconstruction while avoid- to Appendix E. Diffusion models often operate in the im-
age pixel space where z0 is a sample of a real image. In
3
25
Null-text (ours)
Random caption
20 Global null-text
PSNR
Random pivot
15 Textual inversion
VQAE reconstruction
DDIM inversion
10
Input Image DDIM inversion VQAE reconstruction Textual inversion Random pivot Global null-text Random caption Null-text (ours)
Figure 4. Ablation Study. Top: we compare the performance of our full algorithm (green line) to different variations, evaluating the
reconstruction quality by measuring the PSNR score as a function of number optimization iterations and running time in minutes. Bottom:
we visually show the inversion results after 200 iterations of our full algorithm (on right) compared to other baselines. Results for all
iterations are shown in Appendix B (Figs. 13 and 14).
our case, we use the popular and publicly available Stable is inefficient as inference requires only a single noise vec-
Diffusion model [30] where the diffusion forward process tor. Instead, inspired by GAN literature [29], we seek to
is applied on a latent image encoding z0 = E(x0 ) and an perform a more ”local” optimization, ideally using only a
image decoder is employed at the end of the diffusion back- single noise vector. In particular, we aim to perform our
ward process x0 = D(z0 ). optimization around a pivotal noise vector which is a good
approximation and thus allows a more efficient inversion.
Classifier-free guidance. One of the key challenges in We start by studying the DDIM inversion. In practice,
text-guided generation is the amplification of the effect a slight error is incorporated in every step. For uncondi-
induced by the conditioned text. To this end, Ho et al. [18] tional diffusion models, the accumulated error is negligi-
have presented the classifier-free guidance technique, ble and the DDIM inversion succeeds. However, recall that
where the prediction is also performed unconditionally, meaningful editing using the Stable Diffusion model [30]
which is then extrapolated with the conditioned prediction. requires applying classifier-free guidance with a large guid-
More formally, let ∅ = ψ(””) be the embedding of a null ance scale w > 1. We observe that such a guidance scale
text and let w be the guidance scale parameter, then the amplifies the accumulated error. Therefore, performing the
classifier-free guidance prediction is defined by: DDIM inversion procedure with classifier-free guidance re-
ε̃θ (zt , t, C, ∅) = w · εθ (zt , t, C) + (1 − w) · εθ (zt , t, ∅). sults not only in visual artifacts, but the obtained noise vec-
E.g., w = 7.5 is the default parameter for Stable Diffusion. tor might be out of the Gaussian distribution. The latter
decreases the editability, i.e., the ability to edit using the
DDIM inversion. A simple inversion technique was particular noise vector.
suggested for the DDIM sampling [13, 35], based on the We do recognize that using DDIM inversion with guid-
assumption that the ODE process can be reversed in the ance scale w = 1 provides a rough approximation of the
limit of small steps: original image which is highly editable but far from accu-
rate. More specifically, the reversed DDIM produces a T
r s r !
αt+1 1 1
zt+1 = zt + −1− − 1 ·εθ (zt , t, C). steps trajectory between the image encoding z0 to a Gaus-
αt αt+1 αt sian noise vector zT∗ . Again, a large guidance scale is essen-
In other words, the diffusion process is performed in the tial for editing. Hence, we focus on feeding zT∗ to the diffu-
reverse direction, that is z0 → zT instead of zT → z0 , sion process with classifier-free guidance (w > 1). This re-
where z0 is set to be the encoding of the given real image. sults in high editability but inaccurate reconstruction, since
the intermediate latent codes deviate from the trajectory, as
3.2. Pivotal Inversion illustrated in Fig. 3. Analysis of different guidance scale
Recent inversion works use random noise vectors for values for the DDIM inversion is provided in Appendix B
each iteration of their optimization, aiming at mapping ev- (Fig. 9).
ery noise vector to a single image. We observe that this Motivated by the high editability, we refer to this initial
DDIM inversion with w = 1 as our pivot trajectory and
4
optimization. Namely, for each input image, we optimize
only the unconditional embedding ∅, initialized with the
null-text embedding. The model and the conditional textual
embedding are kept unchanged.
This results in high-quality reconstruction while still al-
Input Image Modifed caption: “A girl sitting in a dry field.”
lowing intuitive editing with Prompt-to-Prompt [16] by sim-
ply using the optimized unconditional embedding. More-
over, after a single inversion process, the same uncondi-
tional embedding can be used for multiple editing opera-
tions over the input image. Since null-text optimization is
naturally less expressive than fine-tuning the entire model,
Input Image Modifed caption: “A living room with a zebra dense pattern couch and pillows”
it requires the more efficient pivotal inversion scheme.
Figure 5. Fine control editing using attention re-weighting. We We refer to optimizing a single unconditional embedding
can use attention re-weighting to further control the level of dry- ∅ as a Global null-text optimization. During our experi-
ness over the field or create a denser zebra pattern over the couch. ments, as shown in Fig. 4, we have observed that optimiz-
ing a different ”null embedding” ∅t for each timestamp t
perform our optimization around it with the standard guid- significantly improves the reconstruction quality while this
ance scale, w > 1. That is, our optimization maximizes is well suited for our pivotal inversion. And so, we use per-
the similarity to the original image while maintaining our timestamp unconditional embeddings {∅t }Tt=1 , and initial-
ability to perform meaningful editing. In practice, we ize ∅t with the embedding of the previous step ∅t+1 .
execute a separate optimization for each timestamp t in the Putting the two components together, our full algorithm
order of the diffusion process t = T → t = 1 with the is presented in algorithm 1. The DDIM inversion with
objective of getting close as possible to the initial trajectory w = 1 outputs a sequence of noisy latent codes zT∗ , . . . , z0∗
zT∗ , . . . , z0∗ : where z0∗ = z0 . We initialize z¯T = zt , and perform the
∗ 2 following optimization with the default guidance scale
min zt−1 − zt−1 2
, (2)
w = 7.5 for the timestamps t = T, . . . , 1, each for N
where zt−1 is the intermediate result of the optimization. iterations: 2
∗
Since our pivotal DDIM inversion provides a rather good min zt−1 − zt−1 (z¯t , ∅t , C) 2 . (3)
∅t
starting point, this optimization is highly efficient compared For simplicity, zt−1 (z¯t , ∅t , C) denotes applying DDIM
to using random noise vectors, as demonstrated in Sec. 4. sampling step using z¯t , the unconditional embedding ∅t ,
Note that for every t < T , the optimization should start and the conditional embedding C. At the end of each step,
from the endpoint of the previous step (t + 1) optimization, we update
otherwise our optimized trajectory would not hold at infer- z̄t−1 = zt−1 (z¯t , ∅t , C).
ence. Therefore, after the optimization of step t, we com- We find that early stopping reduces time consumption, re-
pute the current noisy latent z̄t , which is then used in the sulting in ∼ 1 minute using a single A100 GPU.
optimization of the next step to ensure our new trajectory Finally, we can edit the real input image by using the
would end near z0 (see Eq. (3) for more details). noise z¯T = zT∗ and the optimized unconditional embed-
3.3. Null-text optimization dings {∅t }Tt=1 . Please refer to Appendix D for additional
implementation details.
To successfully invert real images into the model’s do-
main, recent works optimize the textual encoding [15], the 4. Ablation Study
network’s weights [31, 39], or both [19]. Fine-tuning the
In this section, we validate the contribution of our main
model’s weight for each image involves duplicating the en-
components, thoroughly analyzing the effectiveness of our
tire model which is highly inefficient in terms of memory
design choices by conducting an ablation study. We focus
consumption. Moreover, unless fine-tuning is applied for
on the fidelity to the input image which is an essential eval-
each and every edit, it necessarily hurts the learned prior
uation for image editing. In Sec. 5 we demonstrate that
of the model and therefore the semantics of the edits. Di-
our method performs high-quality and meaningful manip-
rect optimization of the textual embedding results in a non-
ulations.
interpretable representation since the optimized tokens does
not necessarily match pre-existing words. Therefore, an in- Experimental setting. Evaluation is provided in Fig. 4.
tuitive prompt-to-prompt edit becomes more challenging. We have used a subset of 100 images and captions pairs,
Instead, we exploit the key feature of the classifier-free randomly selected from the COCO [8] validation set. We
guidance — the result is highly affected by the uncondi- then applied our approach on each image-caption pair us-
tional prediction. Therefore, we replace the default null-text ing the default Stable Diffusion hyper-parameters for an
embedding with an optimized one, referred to as null-text increasing number of iterations per diffusion step, N =
5
Input Text2LIVE VQGAN+CLIP SDEdit Ours
Algorithm 1: Null-text inversion
1 Input: A source prompt embedding C = ψ(P) and
input image I.
2 Output: Noise vector zT and optimized
embeddings {∅t }Tt=1 . ”A baby holding her monkey zebra doll.”
6
Input caption: “Two crochet birds sitting on a branch.”
Input caption: “A basket with apples kittens on a chair.” VQGAN+CLIP Text2Live SDEDIT Ours
3.8% 16.6% 14.5% 65.1%
7
details. This results in severe artifacts when fine details are 5.2. Evaluating Additional Editing Technique
involved, such as human faces. For instance, identity drifts
Most of the presented results consist of applying our
in the top row, and the background is not well preserved in
method with the editing technique of Prompt-to-Prompt
the 2nd row. Contrarily, our method successfully preserves
[16]. However, we demonstrate that our method is not con-
the original details, while allowing a wide range of realistic
fined to a specific editing approach, by showing it improves
and meaningful editing, from simple textures to replacing
the results of the SDEdit [23] editing technique.
well-structured objects.
In Fig. 8 (top), we measure the fidelity to the original im-
Fig. 7 presents a comparison to mask-based methods,
age using LPIPS perceptual distance [43] (lower is better),
showing these struggles to preserve details that are found
and the fidelity to the target text using CLIP similarity [26]
inside the masked region. This is due to the masking pro-
(higher is better) over 100 examples. We use different val-
cedure that removes important structural information, and
ues of the SDEdit parameter t0 (marked on the curve), i.e.,
therefore, some capabilities are out of the inpainting reach.
we start the diffusion process from different t = t0 · T us-
A comparison to Imagic [19], which operates in a dif-
ing a correspondingly noised input image. This parameter
ferent setting – requiring model-tuning for each editing op-
controls the trade-off between fidelity to the input image
eration, is provided in Appendix B (Fig. 17). We first em-
(low t0 ) and alignment to the text (high t0 ). We compare
ploy the unofficial Imagic implementation for Stable Diffu-
the standard SDEdit to first applying our inversion and then
sion and present the results for different values of the in-
performing SDEdit while replacing the null-text embedding
terpolation parameter α = 0.6, 0.7, 0.8, 0.9. This param-
with our optimized embeddings. As shown, our inversion
eter is used to interpolate between the target text embed-
significantly improves the fidelity to the input image.
ding and the optimized one [19]. In addition, the Imagic
This is visually demonstrated in Fig. 8 (bottom). Since
authors applied their method using the Imagen model over
the parameter t0 controls a reconstruction-editability trade-
the same images, using the following parameters α =
off, we have used a different parameter for each method
0.93, 0.86, 1.08. As can be seen, Imagic produces highly
(SDEdit with and without our inversion) such that both
meaningful editing, especially when the Imagen model is
achieve the same CLIP score. As can be seen, when using
involved. However, Imagic struggles to preserve the origi-
our method, the true identity of the baby is well preserved.
nal details, such as the identity of the baby (1st row) or cups
in the background (2nd row). Furthermore, we observe that
Imagic is quite sensitive to the interpolation parameter α, as
6. Limitations
a high value reduces the fidelity to the image and a low value While our method works well in most scenarios, it still
reduces the fidelity to the text guidance, while a single value faces some limitations. The most notable one is inference
cannot be applied to all examples. Lastly, Imagic takes a time. Our approach requires approximately one minute
longer inference time, as shown in Appendix C (Tab. 2). on GPU for inverting a single image. Then, infinite edit-
Quantitative Comparison. Since ground truth is not ing operations can be made, each takes only ten seconds.
available for text-based editing of real images, quantitative This is not enough for real-time applications. Other limita-
evaluation remains an open challenge. Similar to [6, 16], tions come from using Stable Diffusion [30] and Prompt-to-
we present a user study in Tab. 1. 50 participants have Prompt editing [16]. First, the VQ auto-encoder produces
rated a total of 48 images for each baseline. The partic- artifacts in some cases, especially when human faces are in-
ipants were recruited using Prolific (prolific.co). We pre- volved. We consider the optimization of the VQ decoder
sented side-by-side images produced by: VQGAN+CLIP, as out of scope here, since this is specific to Stable Dif-
Text2LIVE, SDEdit, and our method (in random order). We fusion and we aim for a general framework. Second, we
focus on methods that share a similar setting to ours – no observe that the generated attentions maps of Stable Dif-
model tuning and mask requirement. The participants were fusion are less accurate compared to the attention maps
asked to choose the method that better applies the requested of Imagen [32], i.e., words might not relate to the correct
edit while preserving most of the original details. A print region, indicating inferior text-based editing capabilities.
screen is provided in Appendix F (Fig. 18). As shown in Lastly, complicated structure modifications are out of reach
Tab. 1, most participants favored our method. for Prompt-to-Prompt, such as changing a seating dog to a
Quantitative comparison to Imagic is presented in Ap- standing one as in [19]. Our inversion approach is orthog-
pendix B (Fig. 11), using the unofficial Stable Diffusion onal to the specific model and editing techniques, and we
implementation. According to these measures, our method believe that these will be improved in the near future.
achieves better scores for LPIPS perceptual distance, indi-
cating a better fidelity to the input image. 7. Conclusions
We have presented an approach to invert real images with
corresponding captions into the latent space of a text-guided
diffusion model while maintaining its powerful editing ca-
8
SDEdit 90%
SDEdit + Ours 80%
0.6
70%
providing us with their support for the Imagic [19] compari-
90%
60% son. We also thank Jay Tenenbaum for the help with writing
80%
the background.
LPIPS
50%
40%
0.4 30% 70%
20% 60%
References
10% 50%
40% [1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
0.2 20% 30%
10% age2stylegan: How to embed images into the stylegan latent
0.26 0.28 0.30 0.32
space? In Proceedings of the IEEE/CVF International Con-
CLIPScore ference on Computer Vision, pages 4432–4441, 2019. 2
Input Image Our Inversion SDEdit Ours + SDEdit Ours + P2P
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
age2stylegan++: How to edit the embedded images? In
Proceedings of the IEEE/CVF conference on computer vi-
sion and pattern recognition, pages 8296–8305, 2020. 2
[3] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and
”Macaroni cake on a table.”
Amit H. Bermano. Hyperstyle: Stylegan inversion with hy-
pernetworks for real image editing, 2021. 2
[4] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2
[5] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended
”A baby wearing a blue shirt lying on the sofa beach”
diffusion for text-driven editing of natural images. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
Figure 8. Our method improves SDEdit results. Top: we evalu- and Pattern Recognition, pages 18208–18218, 2022. 2, 7
ate SDEdit with and without applying null-text inversion. In each [6] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas-
measure, a different SDEdit parameter is used, i.e., different per- ten, and Tali Dekel. Text2live: Text-driven layered image
cent of diffusion steps are applied over the noisy image (marked and video editing. arXiv preprint arXiv:2204.02491, 2022.
on the curve). We measure both fidelity to the original image (via 3, 6, 7, 8, 11, 12
LPIPS, low is better) and fidelity to the target text (via CLIP, high
[7] Amit H Bermano, Rinon Gal, Yuval Alaluf, Ron Mokady,
is better). Bottom, from left to right: input image, null-text inver-
Yotam Nitzan, Omer Tov, Oren Patashnik, and Daniel
sion, SDEdit, applying SDEdit after null-text inversion, and ap-
Cohen-Or. State-of-the-art in the architecture, methods and
plying Prompt-to-Prompt after null-text inversion. As can be seen,
applications of stylegan. In Computer Graphics Forum, vol-
our inversion significantly improves the fidelity to the original im-
ume 41, pages 591–611. Wiley Online Library, 2022. 1, 2
age when applied before SDEdit.
[8] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan-
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick.
pabilities. Our two-step approach first uses DDIM inversion
Microsoft coco captions: Data collection and evaluation
to compute a sequence of noisy codes, which roughly ap- server. arXiv preprint arXiv:1504.00325, 2015. 5
proximate the original image (with the given caption), then [9] Guillaume Couairon, Jakob Verbeek, Holger Schwenk,
uses this sequence as a fixed pivot to optimize the input null- and Matthieu Cord. Diffedit: Diffusion-based seman-
text embedding. Its fine optimization compensates for the tic image editing with mask guidance. arXiv preprint
inevitable reconstruction error caused by the classifier-free arXiv:2210.11427, 2022. 3, 7
guidance component. Once the image-caption pair is accu- [10] Antonia Creswell and Anil Anthony Bharath. Inverting the
rately embedded in the output domain of the model, prompt- generator of a generative adversarial network. IEEE transac-
to-prompt editing can be instantly applied at inference time. tions on neural networks and learning systems, 30(7):1967–
By introducing two new technical concepts to text-guided 1974, 2018. 2
diffusion models – pivotal inversion and null-text optimiza- [11] Katherine Crowson. Vqgan + clip, 2021. https://
tion, we were able to bridge the gap between reconstruc- colab . research . google . com / drive / 1L8oL -
tion and editability. Our approach offers a surprisingly sim- vLJXVcRzCFbPwOoMkPKJ8-aYdPN. 6, 7, 12
ple and compact means to reconstruct an arbitrary image, [12] Katherine Crowson, Stella Biderman, Daniel Kornis,
avoiding the computationally intensive model-tuning. We Dashiell Stander, Eric Hallahan, Louis Castricato, and Ed-
ward Raff. Vqgan-clip: Open domain image generation
believe that null-text inversion paves the way for real-world
and editing with natural language guidance. arXiv preprint
use case scenarios for intuitive, text-based, image editing.
arXiv:2204.08583, 2022. 2, 11
[13] Prafulla Dhariwal and Alexander Nichol. Diffusion models
8. Acknowledgments beat gans on image synthesis. Advances in Neural Informa-
We thank Yuval Alaluf, Rinon Gal, Aleksander Holyn- tion Processing Systems, 34:8780–8794, 2021. 1, 4
ski, Bryan Eric Feldman, Shlomi Fruchter and David [14] Patrick Esser, Robin Rombach, and Björn Ommer. Taming
transformers for high-resolution image synthesis, 2020. 6
Salesin for their valuable inputs that helped improve this
[15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patash-
work, and to Bahjat Kawar, Shiran Zada and Oran Lang for
nik, Amit H Bermano, Gal Chechik, and Daniel Cohen-
9
Or. An image is worth one word: Personalizing text-to- [31] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
image generation using textual inversion. arXiv preprint Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
arXiv:2208.01618, 2022. 2, 5, 6, 7 tuning text-to-image diffusion models for subject-driven
[16] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, generation. arXiv preprint arXiv:2208.12242, 2022. 2, 5
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- [32] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
age editing with cross attention control. arXiv preprint Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
arXiv:2208.01626, 2022. 1, 2, 3, 5, 7, 8 Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- Rapha Gontijo Lopes, Tim Salimans, Tim Salimans,
sion probabilistic models. Advances in Neural Information Jonathan Ho, David J Fleet, and Mohammad Norouzi. Pho-
Processing Systems, 33:6840–6851, 2020. 2, 13 torealistic text-to-image diffusion models with deep lan-
[18] Jonathan Ho and Tim Salimans. Classifier-free diffusion guage understanding. arXiv preprint arXiv:2205.11487,
guidance. In NeurIPS 2021 Workshop on Deep Generative 2022. 1, 2, 3, 7, 8
Models and Downstream Applications, 2021. 1, 4 [33] Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer,
[19] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-Tang Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-
Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: diffusion: Image generation via large-scale retrieval. arXiv
Text-based real image editing with diffusion models. ArXiv, preprint arXiv:2204.02849, 2022. 3
abs/2210.09276, 2022. 2, 3, 5, 7, 8, 9, 11, 12, 19 [34] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
[20] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Dif- and Surya Ganguli. Deep unsupervised learning using
fusionclip: Text-guided diffusion models for robust image nonequilibrium thermodynamics. In International Confer-
manipulation. In Proceedings of the IEEE/CVF Conference ence on Machine Learning, pages 2256–2265. PMLR, 2015.
on Computer Vision and Pattern Recognition, pages 2426– 2, 13
2435, 2022. 2 [35] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[21] Gihyun Kwon and Jong Chul Ye. Clipstyler: Image ing diffusion implicit models. In International Conference
style transfer with a single text condition. arXiv preprint on Learning Representations, 2020. 1, 2, 3, 4, 13
arXiv:2112.00374, 2021. 2 [36] Yang Song and Stefano Ermon. Generative modeling by esti-
[22] Zachary C Lipton and Subarna Tripathi. Precise recovery of mating gradients of the data distribution. Advances in Neural
latent vectors from generative adversarial networks. arXiv Information Processing Systems, 32, 2019. 2
preprint arXiv:1702.04782, 2017. 2
[37] Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia
[23] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-
Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From
Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and
show to tell: a survey on deep learning-based image caption-
editing with stochastic differential equations. arXiv preprint
ing. IEEE Transactions on Pattern Analysis and Machine
arXiv:2108.01073, 2021. 2, 6, 7, 8, 12
Intelligence, 2022. 6
[24] Ron Mokady, Amir Hertz, and Amit H Bermano. Clip-
[38] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and
cap: Clip prefix for image captioning. arXiv preprint
Daniel Cohen-Or. Designing an encoder for stylegan image
arXiv:2111.09734, 2021. 3, 6
manipulation. arXiv preprint arXiv:2102.02766, 2021. 2
[25] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and [39] Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv
Mark Chen. Glide: Towards photorealistic image generation Leviathan. Unitune: Text-driven image editing by fine tuning
and editing with text-guided diffusion models. arXiv preprint an image generation model on a single image. arXiv preprint
arXiv:2112.10741, 2021. 2, 7 arXiv:2210.09477, 2022. 2, 3, 5, 7
[26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [40] Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Qifeng Chen. High-fidelity gan inversion for image attribute
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- editing. ArXiv, abs/2109.06590, 2021. 2
ing transferable visual models from natural language super- [41] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei
vision. arXiv preprint arXiv:2103.00020, 2021. 3, 7, 8 Zhou, and Ming-Hsuan Yang. Gan inversion: A survey,
[27] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, 2021. 1, 2
and Mark Chen. Hierarchical text-conditional image gen- [42] Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G.
eration with clip latents. arXiv preprint arXiv:2204.06125, Schwing, Mark Hasegawa-Johnson, and Minh N. Do. Se-
2022. 1, 2 mantic image inpainting with deep generative models, 2017.
[28] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, 2
Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding [43] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
in style: a stylegan encoder for image-to-image translation. man, and Oliver Wang. The unreasonable effectiveness of
arXiv preprint arXiv:2008.00951, 2020. 2 deep features as a perceptual metric. 2018 IEEE/CVF Con-
[29] Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel ference on Computer Vision and Pattern Recognition, pages
Cohen-Or. Pivotal tuning for latent-based editing of real im- 586–595, 2018. 8
ages. ACM Transactions on Graphics (TOG), 2022. 2, 4 [44] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
[30] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Alexei A Efros. Generative visual manipulation on the nat-
Patrick Esser, and Björn Ommer. High-resolution image syn- ural image manifold. In European conference on computer
thesis with latent diffusion models, 2021. 1, 2, 3, 4, 6, 7, 8, vision, pages 597–613. Springer, 2016. 2
12
10
×104
14
−2.28 Table 2. Inference time comparison. We measure both inversion
log likelihood of zT
−2.29 and editing time for different methods. SDEdit is faster than ours,
PSNR
−2.30
12
as an inversion is not employed by default, but fails to preserve the
−2.31 unedited parts. Our method is more efficient than the rest of the
−2.32 10
baselines, as it provides accurate reconstruction with faster inver-
2 4 6 8 2 4 6 8 sion time, while also allowing multiple editing operations after a
Guidance scale w Guidance scale w
single inversion.
(a) Edibility (b) Reconstruction
Method Inversion Editing Multiple edits
Figure 9. Setting the guidance scale for DDIM. We evaluate the
DDIM inversion with different values of the guidance scale. On VQGAN + CLIP — ∼ 1m No
left, we measure the log-likelihood of the latent vector zT with Text2Live — ∼ 9m No
respect to multivariate normal distribution. This estimates the ed- SDEdit — 10s Yes
itability as zT should ideally distribute normally and deviation Imagic ∼ 5m 10s No
from this distribution reduces our ability to edit the image. On Ours ∼ 1m 10s Yes
right, we measure the reconstruction quality using PSNR. As can
be seen, using a small guidance scale, such as w = 1, results in
better editability and reconstruction.
tion in order to produce semantic attention maps for these
(Fig. 12 bottom). For example, to edit the print on the shirt,
Appendix the source caption should include a ”shirt with a drawing”
term or a similar one.
A. Societal Impact
Our work suggests a new editing technique for manip- Null-text optimization without pivotal inversion. Opti-
ulating real images using state-of-the-art text-to-image dif- mizing the null-text embedding fails without the efficient
fusion models. This modification of real photos might be pivotal inversion. This is demonstrated in Fig. 13 and 14,
exploited by malicious parties to produce fake content in where the non-pivotal null-text optimization produces low-
order to spread disinformation. This is a known problem, quality reconstruction (2nd row).
common to all image editing techniques. However, research
in identifying and preventing malicious editing is already Textual inversion with a pivot. Fig. 15 illustrate per-
making significant progress. We believe our work would forming textual inversion around a pivot, i.e., similar to
contribute to this line of work, since we provide an analysis our pivotal inversion but optimizing the conditioned embed-
of the inversion and editing procedures using text-to-image ding. This results in a comparable reconstruction to ours, as
diffusion models. demonstrated in Fig. 15 (bottom), but editability is reduced.
By analyzing the attention maps (Fig. 15, top), observing
B. Ablation Study that these are less accurate than ours. For example, using
our null-text optimization, the attention referring to ”goats”
DDIM Inversion. To validate our selection of the guid-
is much more local, and attention referring to ”desert” is
ance scale parameter of w = 1 during the DDIM Inver-
more accurate. Consequently, editing the ”desert” results in
sion (see Algorithm 1, line 3, in the main text), we con-
artifacts over the goats (Fig. 15, bottom).
duct the DDIM inversion with different values of w from
1 to 8 using the same data as in Section 4. For each in-
version, we measure the log-likelihood of the result latent C. Additional results
image zT∗ ∈ R64×64×4 under the standard multivariate nor- Additional editing results of our method are provided in
mal distribution. Intuitively, to achieve high edibility we Fig. 10 and additional comparisons are provided in Fig. 16.
would like to maximize this term since during training zT∗
distributes normally. The mean log-likelihood as a function
Inference time comparison. As can be seen in Tab. 2,
of w is plotted in Fig. 9a. In addition, we measure the re-
SDEdit is the fastest since an inversion is not employed, but
construction with respect to the ground truth input image
as a result, it fails to preserve the details of the original im-
using the PSNR metric. As can be seen in Fig. 9b, increas-
age. Our method is more efficient than Text2Live [6], VQ-
ing the value of w results in less editable latent vector zT∗
GAN+CLIP [12] and Imagic [19], as it provides an accurate
and poorer initial reconstruction for our optimization, and
reconstruction in ∼ 1 minute, while also allowing multiple
therefore we use w = 1.
editing operations after a single inversion.
11
”A living room with a couch and pillows” 90%
Imagic (Unofficial)
SDEdit 80%
0.6 0.9
SDEdit + Ours 0.8 70%
90%
0.7 60%
80%
LPIPS
0.6 50%
40%
0.4 30%
70%
20% 60%
Input red velvet couch leather couch unicorn couch 50%
10%
40%
”close up of a giraffe eating a bucket” 20% 30%
10%
0.2
D. Implementation details
Input fish cake avocado cake Lego cake
In all of our experiments, we employ the Stable Diffu-
”A basket with apples on a chair” sion [30] using a DDIM sampler with the default hyperpa-
rameters: number of diffusion steps T = 50 and guidance
scale w = 7.5. Stable diffusion utilizes a pre-trained CLIP
network as the language model ψ. The null-text is tokenized
into start-token, end-token, and 75 non-text padding tokens.
Notice that the padding tokens are also used in CLIP and
Input apples→ puppies apples→ cookies cardboard basket
the diffusion model since both models do not use masking.
”A bicycle is parking on the side of the street” All inversion results except the ones in the ablation study
were obtained using N = 10 (See Algorithm 1 in the main
paper) and a learning rate of 0.01. We have used an early
stop parameter of = 1e − 5 such that the total inversion
for an input image and caption took 40s − 120s on a single
A100 GPU. Namely, for each timestamp t, we stop the op-
Input street→ beach snowy street street→ forest timization when the loss function value reaches = 1e − 5.
”two birds sitting on a branch”
our method achieves better preservation of the original de- Global null-text Inversion. The algorithm for optimiz-
tails (lower LPIPS). This is also supported by the visual re- ing only a single null-text embedding ∅ for all timestamps
sults in Fig. 17, as Imagic struggles to accurately retain the is presented in algorithm 2. In this case, since the optimiza-
background. Furthermore, we observe that Imagic is quite tion of ∅ in a single timestamp affects all other timestamps,
sensitive to the interpolation parameter α, as a high value re- we change the order of the iterations in Algorithm 1. That
duces the fidelity to the image and a low value reduces the is, we perform N iterations in each we optimize ∅ for all the
fidelity to the text, while a single value cannot be applied diffusion timestamps by iterating over t. As shown in Sec-
to all examples. In addition, the authors of Imagic applied
* https://github.com/omerbt/Text2LIVE
their method on the same three images, presented in Fig. 17, † https://github.com/nerdyrodent/VQGAN-CLIP
using α = 0.93, 0.86, 1.08. This results in much better ‡ https://github.com/ermongroup/SDEdit
quality, however, still the background is not preserved. § https://github.com/ShivamShrirao/diffusers/tree/main/examples/imagic
12
tion 4, the convergence of this optimization is much slower with a parameterized Gaussian transition network pθ (xt−1 |
than our final method. More specifically, we found that only xt ) := N (xt−1 | µθ (xt , t), Σθ (xt , t)). The µθ (xt , t) can be
after 7500 optimization steps (about 30 minutes) the global replaced [17] by predicting the noise εθ (xt , t) added to x0
null-text inversion accurately reconstruct the input image. using equation 5.
We use Bayes’ theorem to approximate
Algorithm 2: Global null-text inversion 1 βt
µθ (xt , t) = √ xt − √ εθ (xt , t) . (6)
1 Input: A source prompt P and input image I.
αt 1 − αt
2 Output: Noise vector zT and an optimized Once we have a trained εθ (xt , t), we can using the follow-
embedding ∅ . ing sample method
3 Set guidance scale w = 1; xt−1 = µθ (xt , t) + σt z, z ∼ N (0, I). (7)
4 Compute the intermediate results zT∗ , . . . , z0∗ of
We can control σt of each sample stage, and in DDIMs [35]
DDIM inversion for image I;
the sampling process can be made deterministic using
5 Set guidance scale w = 7.5;
σt = 0 in all the steps. The reverse process can finally be
6 Initialize ∅ ← ψ(””);
trained by solving the following optimization problem:
7 for j = 0, . . . , N − 1 do 2
8 Set z¯T ← zT∗ ; min L(θ) := min Ex0 ∼q(x0 ),w∼N (0,I),t kw − εθ (xt , t)k2 ,
θ θ
9 for t = T, T − 1, . . . , 1 do
∗ 2 teaching the parameters θ to fit q(x0 ) by maximizing a vari-
10 ∅ ← ∅ − η∇∅ zt−1 − zt−1 (z¯t , ∅, C) 2
; ational lower bound.
Set z̄t−1 ← zt−1 (z¯t , ∅, C);
11 end F. User-Study
12 end
13 Return z¯T , ∅ An illustration of our user study is provided in Fig. 18
G. Image Attribution
E. Additional Background - Diffusion Models Girl in a field: https://unsplash.com/photos/
Diffusion Denoising Probabilistic Models (DDPM) [17, 1pCpWipo_jM
34] are generative latent variable models that aim to model Birds on a branch: https://pixabay.com/photos/
a distribution pθ (x0 ) that approximates the data distribu- sparrows-birds-perched-sperlings-3434123/
tion q(x0 ) and easy to sample from. DDPMs model a Basket with apples: https://unsplash.com/photos/
“forward process” in the space of x0 from data to noise. 4Bj27zMqNSE
This is called “forward” due to its procedure progress- Bicycle: https : / / unsplash . com / photos / vZAk _
ing from x0 to xT . Note that this process is a Markov n9Plfc
chain starting from x0 , where we gradually add noise to Child climbing: https : / / unsplash . com / photos /
the data to generate the latent variables x1 , . . . , xT ∈ oLZViCDG-dk
X. The sequence of latent Mountains: https://pixabay.com/photos/desert-
Qt variables, therefore, follows mountains-sky-clouds-peru-4842264/
q(x1 , . . . , xt | x0 ) = i=1 q(xt | xt−1 ), where a step
in the forward process is√defined as a Gaussian transition Giraffe: https://www.flickr.com/photos/tambako/
q(xt | xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) parameterized 30850708538/
by a schedule β0 , . . . , βT ∈ (0, 1). When T is large enough, Blue-haired woman in the forest: https://unsplash.com/
the last noise vector xT nearly follows an isotropic Gaussian photos/I3oRtzyBIFg
distribution. Dining table: https://cocodataset.org/#explore?
An interesting property of the forward process is that id=360849
one can express the latent variable xt directly as the Elephants: https://cocodataset.org/#explore?id=
following linear combination of noise and x0 without 345520
sampling intermediate latent vectors: Man with a doughnut: https : / / cocodataset . org /
√ √ #explore?id=360849
xt = αt x0 + 1 − αt w, w ∼ N (0, I), (5) Cake on a table: https://cocodataset.org/#explore?
Qt id=413699
where αt := i=1 (1 − βi ).
To sample from the distribution q(x0 ), we define the dual Piece of cake: https://cocodataset.org/#explore?
“reverse process” p(xt−1 | xt ) from isotropic Gaussian id=133063
noise xT to data by sampling the posteriors q(xt−1 | xt ).
Since the intractable reverse process q(xt−1 | xt ) depends
on the unknown data distribution q(x0 ), we approximate it
13
Input Image
Input caption: “A woman with a blue hair.”
Our Inversion “...smiling woman...” “...sad woman...” “...curly blue hair...” “...green hair...” woman squirrel woman storm trooper
Input caption: “A woman in the forest.”
Our Inversion “...forest at fall.” “...forest at winter.” forest city forest beach forest water park forest magic kingdom
Input caption: “A woman wearing a shirt with a drawing.”
Our Inversion “...long sleeves shirt...” “...turtle neck shirt...” “...red shirt...” “... drawing of kermit.” “...of cookie monster.” “...of inspector gadget.”
Cross-attention maps
Figure 12. Robustness to the input caption. We can invert an input image (top) using different input captions (first column). Naturally,
the selection of the caption effects the editing abilities with Prompt-to-Prompt, as can be seen in the visualization of the cross-attention
map (bottom). Yet, our method is not particularly sensitive to the exact wording of the prompt.
14
Input caption: “A black dinning room table sitting in a yellow dinning room.”
Figure 13. Ablation study. We show the inversion results for an increasing number of optimization iterations. Our method achieves
high-quality reconstruction with fewer optimization steps.
15
Input caption: “Two people riding elephants in dirty deep water.”
Figure 14. Ablation study. We show the inversion results for an increasing number of optimization iterations. Our method achieves
high-quality reconstruction with fewer optimization steps.
16
Attention maps of Text embedding optimization + Pivotal Inversion
17
Input Our Inversion Text2LIVE VQGAN+CLIP SDEdit Our Editing
18
Input Imagic - Stable Diffusion with α = 0.6, 0.7, 0.8, 0.9 Imagic - Imagen Ours
19
Figure 18. User study print screen.
20