2 DragDiffusion Inpainting
2 DragDiffusion Inpainting
2 DragDiffusion Inpainting
Figure 1. D RAG D IFFUSION greatly improves the applicability of interactive point-based editing. Given an input image, the user clicks
handle points (red), target points (blue), and draws a mask specifying the editable region (brighter area). All results are obtained under the
same user edit for fair comparisons. Project page: https://yujun-shi.github.io/projects/dragdiffusion.html
.
Abstract vations that UNet features at a specific time step provides
sufficient semantic and geometric information to support
Accurate and controllable image editing is a challenging the drag-based editing. Moreover, we introduce two ad-
task that has attracted significant attention recently. No- ditional techniques, namely identity-preserving fine-tuning
tably, D RAG GAN developed by Pan et al. (2023) [33] is and reference-latent-control, to further preserve the iden-
an interactive point-based image editing framework that tity of the original image. Lastly, we present a challenging
achieves impressive editing results with pixel-level preci- benchmark dataset called D RAG B ENCH—the first bench-
sion. However, due to its reliance on generative adversarial mark to evaluate the performance of interactive point-based
networks (GANs), its generality is limited by the capacity of image editing methods. Experiments across a wide range of
pretrained GAN models. In this work, we extend this edit- challenging cases (e.g., images with multiple objects, di-
ing framework to diffusion models and propose a novel ap- verse object categories, various styles, etc.) demonstrate
proach D RAG D IFFUSION. By harnessing large-scale pre- the versatility and generality of D RAG D IFFUSION. Code
trained diffusion models, we greatly enhance the applica- and the D RAG B ENCH dataset: https://github.com/Yujun-
bility of interactive point-based editing on both real and Shi/DragDiffusion.
diffusion-generated images. Unlike other diffusion-based
editing methods that provide guidance on diffusion latents
of multiple time steps, our approach achieves efficient yet 1. Introduction
accurate spatial control by optimizing the latent of only one
Image editing with generative models [9, 15, 22, 31, 34, 37]
time step. This novel design is motivated by our obser-
has attracted extensive attention recently. One landmark
* Work done when interning with Song Bai. work is D RAG GAN [33], which enables interactive point-
8839
based image editing, i.e., drag-based editing. Under this
framework, the user first clicks several pairs of handle and
target points on an image. Then, the model performs seman-
tically coherent editing on the image that moves the con-
tents of the handle points to the corresponding target points.
In addition, users can draw a mask to specify which region
of the image is editable while the rest remains unchanged.
Despite D RAG GAN’s impressive editing results with
pixel-level spatial control, the applicability of this method
is being limited by the inherent model capacity of genera-
tive adversarial networks (GANs) [12, 20, 21]. On the other
hand, although large-scale text-to-image diffusion models Figure 2. PCA visualization of UNet feature maps at different
[38, 42] have demonstrated strong capabilities to synthesize diffusion steps for two video frames. t = 50 implies the full
high quality images across various domains, there are not DDIM inversion, while t = 0 implies the clean image. Notably,
many diffusion-based editing methods that can achieve pre- UNet features at one specific step (e.g., t = 35) provides sufficient
cise spatial control. This is because most diffusion-based semantic and geometric information (e.g., shape and pose of the
cat, etc.) for the drag-based editing.
methods [15, 22, 31, 34] conduct editing by controlling the
text embeddings, which restricts their applicability to edit-
benchmark dataset for drag-based editing. D RAG B ENCH
ing high-level semantic contents or styles.
is a diverse collection comprising images spanning various
To bridge this gap, we propose D RAG D IFFUSION, the object categories, indoor and outdoor scenes, realistic and
first interactive point-based image editing method with dif- aesthetic styles, etc. Each image in our dataset is accompa-
fusion models [17, 38, 42, 46]. Empowered by large-scale nied with a set of “drag” instructions, which consists of one
pre-trained diffusion models [38, 42], D RAG D IFFUSION or more pairs of handle and target points as well as a mask
achieves accurate spatial control in image editing with sig- specifying the editable region.
nificantly better generalizability (see Fig. 1).
Through extensive qualitative and quantitative experi-
Our approach focuses on optimizing diffusion latents to ments on a variety of examples (including those on D RAG -
achieve drag-based editing, which is inspired by the fact B ENCH), we demonstrate the versatility and generality of
that diffusion latents can accurately determine the spatial our approach. In addition, our empirical findings corrob-
layout of the generated images [29]. In contrast to previ- orate the crucial role played by identity-preserving fine-
ous methods [3, 10, 34, 53], which apply gradient descent tuning and reference-latent-control. Furthermore, a com-
on latents of multiple diffusion steps, our approach focuses prehensive ablation study is conducted to meticulously ex-
on optimizing the latent of one appropriately selected step plore the influence of key factors, including the number
to conveniently achieve the desired editing results. This of inversion steps of the latent, the number of identity-
novel design is motivated by the empirical observations pre- preserving fine-tuning steps, and the UNet feature maps.
sented in Fig. 2. Specifically, given two frames from a video Our contributions are summarized as follows: 1) we
simulating the original and the “dragged” images, we vi- present a novel image editing method D RAG D IFFUSION,
sualize the UNet feature maps of different diffusion steps the first to achieve interactive point-based editing with dif-
using principal component analysis (PCA). Via this visu- fusion models; 2) we introduce D RAG B ENCH, the first
alization, we find that there exists a single diffusion step benchmark dataset to evaluate interactive point-based image
(e.g., t = 35 in this case) such that the UNet feature maps editing methods; 3) Comprehensive qualitative and quanti-
at this step alone contains sufficient semantic and geomet- tative evaluation demonstrate the versatility and generality
ric information to support structure-oriented spatial control of our D RAG D IFFUSION.
such as drag-based editing. Besides optimizing the diffu-
sion latents, we further introduce two additional techniques
to enhance the identity preserving during the editing pro-
2. Related Work
cess, namely identity-preserving fine-tuning and reference- Generative Image Editing. Given the initial successes
latent-control. An overview of our method is given in Fig. 3. of generative adversarial networks (GANs) in image gen-
It would be ideal to immediately evaluate our method eration [12, 20, 21], many previous image editing meth-
on well-established benchmark datasets. However, due to ods have been based on the GAN paradigm [2, 9, 14, 25,
a lack of evaluation benchmarks for interactive point-based 33, 35, 44, 45, 49, 55, 56]. However, due to the lim-
editing, it is difficult to rigorously study and corroborate ited model capacity of GANs and the difficulty of inverting
the efficacy of our proposed approach. Therefore, to facil- the real images into GAN latents [1, 8, 28, 37], the gen-
itate such evaluation, we present D RAG B ENCH—the first erality of these methods would inevitably be constrained.
8840
Figure 3. Overview of D RAG D IFFUSION. Our approach constitutes three steps: firstly, we conduct identity-preserving fine-tuning on the
UNet of the diffusion model given the input image. Secondly, according to the user’s dragging instruction, we optimize the latent obtained
from DDIM inversion on the input image. Thirdly, we apply DDIM denoising guided by our reference-latent-control on ẑt to obtain the
final editing result ẑ0 . Figure best viewed in color.
Recently, due to the impressive generation results from LoRA) [18] is a general technique to conduct parameter-
large-scale text-to-image diffusion models [38, 42], many efficient fine-tuning on large and deep networks. During
diffusion-based image editing methods have been proposed LoRA fine-tuning, the original weights of the model are
[4, 6, 7, 15, 22, 26, 29, 30, 32, 34, 50]. Most of these meth- frozen, while trainable rank decomposition matrices are in-
ods aim to edit the images by manipulating the prompts of jected into each layer. The core assumption of this strategy
the image. However, as many editing attempts are difficult is that the model weights will primarily be adapted within
to convey through text, the prompt-based paradigm usually a low rank subspace during fine-tuning. While LoRA was
alters the image’s high-level semantics or styles, lacking the initially introduced for adapting language models to down-
capability of achieving precise pixel-level spatial control. stream tasks, recent efforts have illustrated its effectiveness
[10] is one of the early efforts in exploring better controlla- when applied in conjunction with diffusion models [13, 41].
bility on diffusion models beyond the prompt-based image In this work, inspired by the promising results of using
editing. In our work, we aim at enabling a even more ver- LoRA for image generation and editing [22, 40], we also
satile paradigm than the one studied in [10] with diffusion implement our identity-preserving fine-tuning with LoRA.
models—interactive point-based image editing.
Point-based editing. The framework of point-based
3. Methodology
editing [5, 19, 43] aims at manipulating images in a fine- In this section, we formally present the proposed
grained level. Recently, to enable such editing, sev- D RAG D IFFUSION approach. To commence, we introduce
eral GAN-based methods have been proposed [9, 33, 52]. the preliminaries on diffusion models. Then, we elabo-
Specifically, D RAG GAN achieves impressive dragging- rate on the three key stages of our approach as depicted
based manipulation with two simple ingredients: 1) opti- in Fig. 3: 1) identity-preserving fine-tuning; 2) latent op-
mizing latent codes to move the handle points towards their timization according to the user-provided dragging instruc-
target locations and 2) a point tracking mechanism that keep tions; 3) denoising the optimized latents guided by our
tracks of the handle points. However, its generality is con- reference-latent-control.
strained due to the limited capacity of GAN. FreeDrag [27]
improves D RAG GAN by introducing a point-tracking-free 3.1. Preliminaries on Diffusion Models
paradigm. In this work, we extend the editing framework Denoising diffusion probabilistic models (DDPM) [17, 46]
of D RAG GAN to diffusion models and showcase its gener- constitute a family of latent generative models. Concerning
ality over different domains. There is a work [32] concur- a data distribution q(z), DDPM approximates q(z) as the
rent to ours that also studies drag-based editing with diffu- marginal pθ (z0 ) of the joint distribution between Z0 and a
sion models. Differently, they rely on classifier guidance to collection of latent random variables Z1:T . Specifically,
transforms the editing signal into gradients. Z
LoRA in Diffusion Models. Low Rank Adaptation (i.e., pθ (z0 ) = pθ (z0:T ) dz1:T , (1)
8841
where pθ (zT ) is a standard normal distribution and the tran- itable region) to achieve the desired interactive point-based
sition kernels pθ (zt−1 |zt ) of this Markov chain are all Gaus- editing (see panel (2) of Fig. 3).
sian conditioned on zt . In our context, Z0 corresponds to To commence, we first apply a DDIM inversion [47] on
image samples given by users, and Zt corresponds to the the given real image to obtain a diffusion latent at a cer-
latent after t steps of the diffusion process. tain step t (i.e., zt ). This diffusion latent serves as the ini-
[38] proposed the latent diffusion model (LDM), which tial value for our latent optimization process. Then, follow-
maps data into a lower-dimensional space via a variational ing along the similar spirit of [33], the latent optimization
auto-encoder (VAE) [24] and models the distribution of the process consists of two steps to be implemented consec-
latent embeddings instead. Based on the framework of utively. These two steps, namely motion supervision and
LDM, several powerful pretrained diffusion models have point tracking, are executed repeatedly until either all han-
been released publicly, including the Stable Diffusion (SD) dle points have moved to the targets or the maximum num-
model (https://huggingface.co/stabilityai). In SD, the net- ber of iterations has been reached. Next, we describe these
work responsible for modeling pθ (zt−1 |zt ) is implemented two steps in detail.
as a UNet [39] that comprises multiple self-attention and Motion Supervision: We denote the n handle points at
cross-attention modules [51]. Applications in this paper are the k-th motion supervision iteration as {hki = (xki , yik ) :
implemented based on the public Stable Diffusion model. i = 1, . . . , n} and their corresponding target points as
{gi = (x̃i , ỹi ) : i = 1, . . . , n}. The input image is denoted
3.2. Identity-preserving Fine-tuning as z0 ; the t-th step latent (i.e., result of t-th step DDIM in-
Before editing a real image, we first conduct identity- version) is denoted as zt . We denote the UNet output feature
preserving fine-tuning [18] on the diffusion models’ UNet maps used for motion supervision as F (zt ), and the feature
(see panel (1) of Fig. 3). This stage aims to ensure that the vector at pixel location hki as Fhki (zt ). Also, we denote the
diffusion UNet encodes the features of input image more ac- square patch centered around hki as Ω(hki , r1 ) = {(x, y) :
curately (than in the absence of this procedure), thus facili- |x−xki | ≤ r1 , |y −yik | ≤ r1 }. Then, the motion supervision
tating the consistency of the identity of the image through- loss at the k-th iteration is defined as:
out the editing process. This fine-tuning process is imple- n
X X
mented with LoRA [18]. More formally, the objective func- Lms (ẑtk ) = Fq+di (ẑtk ) − sg(Fq (ẑtk )) 1
tion of the LoRA is i=1 q∈Ω(hk ,r1 )
i
where θ and ∆θ represent the UNet and LoRA parameters where ẑtk is the t-th step latent after the k-th update, sg(·) is
respectively, z is the real image, ϵ ∼ N (0, I) is the ran- the stop gradient operator (i.e., the gradient will not be back-
domly sampled noise map, ϵθ+∆θ (·) is the noise map pre- propagated for the term sg(Fq (ẑtk ))), di = (gi − hki )/∥gi −
dicted by the LoRA-integrated UNet, and αt and σt are pa- hki ∥2 is the normalized vector pointing from hki to gi , M is
rameters of the diffusion noise schedule at diffusion step t. the binary mask specified by the user, Fq+di (ẑtk ) is obtained
The fine-tuning objective is optimized via gradient descent via bilinear interpolation as the elements of q + di may not
on ∆θ. be integers. In each iteration, ẑtk is updated by taking one
Remarkably, we find that fine-tuning LoRA for merely gradient descent step to minimize Lms :
80 steps proves sufficient for our approach, which is in
∂Lms (ẑtk )
stark contrast to the 1000 steps required by tasks such as ẑtk+1 = ẑtk − η · , (4)
subject-driven image generation [13, 40]. This ensures that ∂ ẑtk
our identity-preserving fine-tuning process is extremely ef- where η is the learning rate for latent optimization.
ficient, and only takes around 25 seconds to complete on an Note that for the second term in Eqn. (3), which encour-
A100 GPU. We posit this efficiency is because our approach ages the unmasked area to remain unchanged, we are work-
operates on the inverted noisy latent, which inherently pre- ing with the diffusion latent instead of the UNet features.
serve some information about the input real image. Conse- Specifically, given ẑtk , we first apply one step of DDIM de-
quently, our approach does not require lengthy fine-tuning k
noising to obtain ẑt−1 , then we regularize the unmasked re-
to preserve the identity of the original image. k
gion of ẑt−1 to be the same as ẑt−10
(i.e., zt−1 ).
Point Tracking: Since the motion supervision updates
3.3. Diffusion Latent Optimization
ẑtk , the positions of the handle points may also change.
After identity-preserving fine-tuning, we optimize the diffu- Therefore, we need to perform point tracking to update
sion latent according to the user instruction (i.e., the handle the handle points after each motion supervision step. To
and target points, and optionally a mask specifying the ed- achieve this goal, we use UNet feature maps F (ẑtk+1 ) and
8842
F (zt ) to track the new handle points. Specifically, we up- the latent. The maximum optimization step is set to be 80.
date each of the handle points hki with a nearest neigh- The hyperparameter r1 in Eqn. 3 and r2 in Eqn. 5 are tuned
bor search within the square patch Ω(hki , r2 ) = {(x, y) : to be 1 and 3, respectively. λ in Eqn. 3 is set to 0.1 by de-
|x − xki | ≤ r2 , |y − yik | ≤ r2 } as follows: fault, but the user may increase λ if the unmasked region
has changed to be more than what was desired.
hk+1
i = arg min Fq (ẑtk+1 ) − Fh0i (zt ) . (5) Finally, we apply our reference-latent-control in the up-
q∈Ω(hk 1
i ,r2 ) sampling blocks of the diffusion UNet at all denoising steps
when generating the editing results. The execution time for
3.4. Reference-latent-control each component is detailed in Appendix D.
After we have completed the optimization of the diffusion
4.2. D RAG B ENCH and Evaluation Metrics
latents, we then denoise the optimized latents to obtain the
final editing results. However, we find that naı̈vely applying Since interactive point-based image editing is a recently in-
DDIM denoising on the optimized latents still occasionally troduced paradigm, there is an absence of dedicated eval-
leads to undesired identity shift or degradation in quality uation benchmarks for this task, making it challenging to
comparing to the original image. We posit that this issue comprehensively study the effectiveness of our proposed
arises due to the absence of adequate guidance from the approach. To address the need for systematic evaluation,
original image during the denoising process. we introduce D RAG B ENCH, the first benchmark dataset tai-
To mitigate this problem, we draw inspiration from [7] lored for drag-based editing. D RAG B ENCH is a diverse
and propose to leverage the property of self-attention mod- compilation encompassing various types of images. Details
ules to steer the denoising process, thereby boosting coher- and examples of our dataset are given in Appendix A. Each
ence between the original image and the editing results. In image within our dataset is accompanied by a set of drag-
particular, as illustrated in panel (3) of Fig. 3, given the de- ging instructions, comprising one or more pairs of handle
noising process of both the original latent zt and the opti- and target points, along with a mask indicating the editable
mized latent ẑt , we use the process of zt to guide the process region. We hope future research on this task can benefit
of ẑt . More specifically, during the forward propagation from D RAG B ENCH.
of the UNet’s self-attention modules in the denoising pro- In this work, we utilize the following two metrics for
cess, we replace the key and value vectors generated from quantitative evaluation: Image Fidelity (IF) [22] and Mean
ẑt with the ones generated from zt . With this simple re- Distance (MD) [33]. IF, the first metric, quantifies the sim-
placement technique, the query vectors generated from ẑt ilarity between the original and edited images. It is cal-
will be directed to query the correlated contents and texture culated by subtracting the mean LPIPS [54] over all pairs
of zt . This leads to the denoising results of ẑt (i.e., ẑ0 ) be- of original and edited images from 1. The second metric
ing more coherent with the denoising results of zt (i.e., z0 ). MD assesses how well the approach moves the semantic
In this way, reference-latent-control substantially improves contents to the target points. To compute the MD, we first
the consistency between the original and the edited images. employ DIFT [48] to identify points in the edited images
corresponding to the handle points in the original image.
4. Experiments These identified points are considered to be the final han-
dle points post-editing. MD is subsequently computed as
4.1. Implementation Details the mean Euclidean distance between positions of all tar-
In all our experiments, unless stated otherwise, we adopt get points and their corresponding final handle points. MD
the Stable Diffusion 1.5 [38] as our diffusion model. Dur- is averaged over all pairs of handle and target points in the
ing the identity-preserving fine-tuning, we inject LoRA into dataset. An optimal “drag” approach ideally achieves both
the projection matrices of query, key and value in all of the a low MD (indicating effective “dragging”) and a high IF
attention modules. We set the rank of the LoRA to 16. We (reflecting robust identity preservation).
fine-tune the LoRA using the AdamW [23] optimizer with a
4.3. Qualitative Evaluation
learning rate of 5 × 10−4 and a batch size of 4 for 80 steps.
During the latent optimization stage, we schedule 50 In this section, we first compare our approach with D RAG -
steps for DDIM and optimize the diffusion latent at the 35- GAN on real images. We employ SD-1.5 for our approach
th step unless specified otherwise. When editing real im- when editing real images. All input images and the user edit
ages, we do not apply classifier-free guidance (CFG) [16] instructions are from our D RAG B ENCH dataset. Results are
in both DDIM inversion and DDIM denoising process. This given in Fig. 4. As illustrated in the figure, when dealing
is because CFG tends to amplify numerical errors, which is with the real images from a variety of domains, D RAG GAN
not ideal in performing the DDIM inversion [31]. We use often struggles due to GAN models’ limited capacity. On
the Adam optimizer with a learning rate of 0.01 to optimize the other hand, our D RAG D IFFUSION can convincingly de-
8843
Figure 4. Comparisons between D RAG GAN and D RAG D IFFUSION. All results are obtained under the same user edit for fair comparisons.
Figure 5. Editing results on diffusion-generated images with (a) Stable-Diffusion-1.5, (b) Counterfeit-V2.5, (c) Majicmix Realistic, (d)
Interior Design Supermix.
8844
Figure 6. Ablating the number of inversion step t. Effective results are obtained when t ∈ [30, 40].
liver reasonable editing results. More importantly, besides and the y-axis represents IF, which indicates the method
achieving the similar pose manipulation and local deforma- with better results should locate at the upper-left corner of
tion as in D RAG GAN [33], our approach even enables more the coordinate plane. The results presented in this figure
types of editing such as content filling. An example is given clearly demonstrate that our D RAG D IFFUSION significantly
in Fig. 4 (a), where we fill the grassland with the pool using outperforms the D RAG GAN baseline in terms of both IF
drag-based editing. This further validates the enhanced ver- and MD. Furthermore, we observe that D RAG D IFFUSION
satility of our approach. More qualitative comparisons are without identity-preserving fine-tuning experiences a catas-
provided in Appendix F. trophic increase in MD, whereas D RAG D IFFUSION with-
Next, to show the generality of our approach, we perform out reference-latent control primarily encounters a decrease
drag-based editing on diffusion-generated images across a in IF. Visualization on the effects of identity-preserving
spectrum of variants of SD-1.5, including SD-1.5 itself, fine-tuning and reference-latent-control are given in Fig. 9,
Counterfeit-V2.5, Majicmix Realistic, Interior Design Su- which corroborates with our quantitative results.
permix. Results are shown in Fig. 5 These results validate
our approach’s ability to smoothly work with various pre- 4.5. Ablation on the Number of Inversion Step
trained diffusion models. Moreover, these results also il- Next, we conduct an ablation study to elucidate the impact
lustrate our approach’s ability to deal with drag instructions of varying t (i.e., the number of inversion steps) during the
of different magnitudes (e.g., small magnitude edits such as latent optimization stage of D RAG D IFFUSION. We set t to
the left-most image in Fig. 5 (d) and large magnitude edits be t = 10, 20, 30, 40, 50 steps and run our approach on
such as the left-most image in Fig. 5 (c)). Additional results D RAG B ENCH to obtain the editing results (t = 50 corre-
with more diffusion models and different resolutions can be sponds to the pure noisy latent). We evaluate Image Fidelity
found in Appendix F. (IF) and Mean Distance (MD) for each t value in Fig. 7(a).
All metrics are averaged over the D RAG B ENCH dataset.
4.4. Quantitative Analysis In terms of the IF, we observe a monotonic decrease as t
In this section, we conduct a rigorous quantitative evalua- increases. This trend can be attributed to the stronger flexi-
tion to assess the performance of our approach. We begin bility of the diffusion latent as more steps are inverted. As
by comparing D RAG D IFFUSION with the baseline method for MD, it initially decreases and then increases with higher
D RAG GAN. As each StyleGAN [21] model utilized in [33] t values. This behavior highlights the presence of a critical
is specifically designed for a particular image class, we em- range of t values for effective editing (t ∈ [30, 40] in our fig-
ploy an ensemble strategy to evaluate D RAG GAN. This ure). When t is too small, the diffusion latent lacks the nec-
strategy involves assigning a text description to character- essary flexibility for substantial changes, posing challenges
ize the images generated by each StyleGAN model. Before in performing reasonable edits. Conversely, overly large t
editing each image, we compute the CLIP similarity [36] values result in a diffusion latent that is unstable for edit-
between the image and each of the text descriptions associ- ing, leading to difficulties in preserving the original image’s
ated with the GAN models. The GAN model that yields the identity. Given these results, we chose t = 35 as our default
highest CLIP similarity is selected for the editing task. setting, as it achieves the lowest MD while maintaining a
Furthermore, to validate the effectiveness of each com- decent IF. Qualitative visualization that corroborates with
ponent of our approach, we evaluate D RAG D IFFUSION in our numerical evaluation is provided in Fig. 6.
the following two configurations: one without identity-
4.6. Ablation Study on the Number of Identity-
preserving fine-tuning and the other without reference-
preserving Fine-tuning Steps
latent-control. We perform our empirical studies on the
D RAG B ENCH dataset, and Image Fidelity (IF) and Mean We run our approach on D RAG B ENCH under 0, 20, 40, 60,
Distance (MD) of each configuration mentioned above are 80, and 100 identity-preserving fine-tuning steps, respec-
reported in Fig. 8. All results are averaged over the D RAG - tively (0 being no fine-tuning). The outcomes are assessed
B ENCH dataset. In this figure, the x-axis represents MD using IF and MD, and the results are presented in Fig.7 (b).
8845
(a) (b) (c)
Figure 7. Ablation study on (a) the number of inversion step t of the diffusion latent; (b) the number of identity-preserving fine-tuning
steps; (c) Block No. of UNet feature maps. Mean Distance (↓) and Image Fidelity (↑) are reported. Results are produced on D RAG B ENCH.
0.85
4.7. Ablation Study on the UNet Feature Maps
0.80 Finally, we study the effect of using different blocks of
UNet feature maps to supervise our latent optimization.
0.75 DragGAN We run our approach on the D RAG B ENCH dataset with
DragDiffusion w/o fine-tune the feature maps output by 4 different upsampling blocks
DragDiffusion w/o ref-latent-control of the UNet Decoder, respectively. The outcomes are as-
0.70 DragDiffusion
sessed with IF and MD, and are shown in Fig. 7(c). As
35 40 45 50 55 can be seen, with deeper blocks of UNet features, IF con-
Mean Distance sistently increases, while MD first decreases and then in-
Figure 8. Quantitative analysis on D RAG GAN, D RAG D IFFUSION creases. This trend is because feature maps of lower blocks
and D RAG D IFFUSION’s variants without certain components. Im- contain coarser semantic information, while higher blocks
age Fidelity (↑) and Mean Distance (↓) are reported. Results contain lower level texture information [11, 50]. Hence, the
are produced on D RAG B ENCH. The approach with better results feature maps of lower blocks (e.g., block No. of 1) lack fine-
should locate at the upper-left corner of the coordinate plane. grained information for accurate spatial control, whereas
those of higher blocks (e.g., block No. of 4) lack semantic
and geometric information to drive the drag-based editing.
Our results indicate that the feature maps produced by the
third block of the UNet decoder demonstrate the best per-
formance, exhibiting the lowest MD and a relatively high
IF. Visualizations that corroborate our quantitative evalua-
tion are presented in the Appendix H.
8846
References [14] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and
Sylvain Paris. Ganspace: Discovering interpretable gan con-
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- trols. Advances in neural information processing systems,
age2stylegan: How to embed images into the stylegan latent 33:9841–9850, 2020. 2
space? In Proceedings of the IEEE/CVF international con-
[15] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
ference on computer vision, pages 4432–4441, 2019. 2
Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im-
[2] Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. age editing with cross attention control. arXiv preprint
Styleflow: Attribute-conditioned exploration of stylegan- arXiv:2208.01626, 2022. 1, 2, 3
generated images using conditional continuous normalizing [16] Jonathan Ho and Tim Salimans. Classifier-free diffusion
flows. ACM Transactions on Graphics (ToG), 40(3):1–21, guidance. In NeurIPS 2021 Workshop on Deep Generative
2021. 2 Models and Downstream Applications, 2021. 5
[3] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, [17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
Soumyadip Sengupta, Micah Goldblum, Jonas Geip- sion probabilistic models. Advances in Neural Information
ing, and Tom Goldstein. Universal guidance for diffusion Processing Systems, 33:6840–6851, 2020. 2, 3
models. In Proceedings of the IEEE/CVF Conference on
[18] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
Computer Vision and Pattern Recognition, pages 843–852,
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
2023. 2
LoRA: Low-rank adaptation of large language models. In In-
[4] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kas- ternational Conference on Learning Representations (ICLR),
ten, and Tali Dekel. Text2live: Text-driven layered image 2022. 3, 4
and video editing. In European Conference on Computer [19] Takeo Igarashi, Tomer Moscovich, and John F Hughes. As-
Vision, pages 707–723. Springer, 2022. 3 rigid-as-possible shape manipulation. ACM transactions on
[5] Thaddeus Beier and Shawn Neely. Feature-based image Graphics (TOG), 24(3):1134–1141, 2005. 3
metamorphosis. In Seminal Graphics Papers: Pushing the [20] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Boundaries, Volume 2, pages 529–536. 2023. 3 generator architecture for generative adversarial networks.
[6] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- In Proceedings of the IEEE/CVF conference on computer vi-
structpix2pix: Learning to follow image editing instructions. sion and pattern recognition, pages 4401–4410, 2019. 2
In Proceedings of the IEEE/CVF Conference on Computer [21] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
Vision and Pattern Recognition, pages 18392–18402, 2023. Jaakko Lehtinen, and Timo Aila. Analyzing and improv-
3 ing the image quality of stylegan. In Proceedings of
[7] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- the IEEE/CVF conference on computer vision and pattern
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- recognition, pages 8110–8119, 2020. 2, 7
tual self-attention control for consistent image synthesis and [22] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen
editing. arXiv preprint arXiv:2304.08465, 2023. 3, 5 Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic:
[8] Antonia Creswell and Anil Anthony Bharath. Inverting the Text-based real image editing with diffusion models. In Pro-
generator of a generative adversarial network. IEEE transac- ceedings of the IEEE/CVF Conference on Computer Vision
tions on neural networks and learning systems, 30(7):1967– and Pattern Recognition, pages 6007–6017, 2023. 1, 2, 3, 5
1974, 2018. 2 [23] Diederik P Kingma and Jimmy Ba. Adam: A method for
[9] Yuki Endo. User-controllable latent transformer for style- stochastic optimization. In Proceedings of the International
gan image layout editing. arXiv preprint arXiv:2208.12408, Conference on Learning Representations (ICLR), 2015. 5
2022. 1, 2, 3 [24] Diederik P Kingma and Max Welling. Auto-encoding varia-
[10] Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, and tional bayes. arXiv preprint arXiv:1312.6114, 2013. 4
Aleksander Holynski. Diffusion self-guidance for control- [25] Thomas Leimkühler and George Drettakis. Freestylegan:
lable image generation. arXiv preprint arXiv:2306.00986, Free-view editable portrait rendering with the camera mani-
2023. 2, 3 fold. arXiv preprint arXiv:2109.09378, 2021. 2
[11] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. [26] Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng.
Tokenflow: Consistent diffusion features for consistent video Magicmix: Semantic mixing with diffusion models. arXiv
editing. arXiv preprint arxiv:2307.10373, 2023. 8 preprint arXiv:2210.16056, 2022. 3
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [27] Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen,
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and and Yi Jin. Freedrag: Point tracking is not you need
Yoshua Bengio. Generative adversarial nets. In Advances in for interactive point-based image editing. arXiv preprint
Neural Information Processing Systems. Curran Associates, arXiv:2307.04684, 2023. 3
Inc., 2014. 2 [28] Zachary C Lipton and Subarna Tripathi. Precise recovery of
[13] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yun- latent vectors from generative adversarial networks. arXiv
peng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning preprint arXiv:1702.04782, 2017. 2
Chang, Weijia Wu, et al. Mix-of-show: Decentralized low- [29] Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. Guided
rank adaptation for multi-concept customization of diffusion image synthesis via initial image editing in diffusion model.
models. arXiv preprint arXiv:2305.18292, 2023. 3, 4 arXiv preprint arXiv:2305.03382, 2023. 2, 3
8847
[30] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jia- Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
jun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided et al. Photorealistic text-to-image diffusion models with deep
image synthesis and editing with stochastic differential equa- language understanding. Advances in Neural Information
tions. arXiv preprint arXiv:2108.01073, 2021. 3 Processing Systems, 35:36479–36494, 2022. 2, 3
[31] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and [43] Scott Schaefer, Travis McPhail, and Joe Warren. Image de-
Daniel Cohen-Or. Null-text inversion for editing real im- formation using moving least squares. In ACM SIGGRAPH
ages using guided diffusion models. In Proceedings of 2006 Papers, pages 533–540. 2006. 3
the IEEE/CVF Conference on Computer Vision and Pattern [44] Yujun Shen and Bolei Zhou. Closed-form factorization of la-
Recognition, pages 6038–6047, 2023. 1, 2, 5 tent semantics in gans. In Proceedings of the IEEE/CVF con-
[32] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and ference on computer vision and pattern recognition, pages
Jian Zhang. Dragondiffusion: Enabling drag-style manipula- 1532–1540, 2021. 2
tion on diffusion models. arXiv preprint arXiv:2307.02421, [45] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In-
2023. 3 terpreting the latent space of gans for semantic face editing.
[33] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie In Proceedings of the IEEE/CVF conference on computer vi-
Liu, Abhimitra Meka, and Christian Theobalt. Drag your sion and pattern recognition, pages 9243–9252, 2020. 2
GAN: Interactive point-based manipulation on the generative [46] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
image manifold. arXiv preprint arXiv:2305.10973, 2023. 1, and Surya Ganguli. Deep unsupervised learning using
2, 3, 4, 5, 7 nonequilibrium thermodynamics. In International Confer-
[34] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun ence on Machine Learning, pages 2256–2265. PMLR, 2015.
Li, Jingwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image 2, 3
translation. arXiv preprint arXiv:2302.03027, 2023. 1, 2, 3 [47] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
[35] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, ing diffusion implicit models. In Proceedings of the Inter-
and Dani Lischinski. Styleclip: Text-driven manipulation of national Conference on Learning Representations (ICLR),
stylegan imagery. In Proceedings of the IEEE/CVF Inter- 2021. 4
national Conference on Computer Vision, pages 2085–2094,
[48] Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng
2021. 2
Phoo, and Bharath Hariharan. Emergent correspondence
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
from image diffusion. arXiv preprint arXiv:2306.03881,
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
2023. 5
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
[49] Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian
transferable visual models from natural language supervi-
Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zoll-
sion. In International conference on machine learning, pages
hofer, and Christian Theobalt. Stylerig: Rigging style-
8748–8763. PMLR, 2021. 7
gan for 3d control over portrait images. In Proceedings of
[37] Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel
the IEEE/CVF Conference on Computer Vision and Pattern
Cohen-Or. Pivotal tuning for latent-based editing of real im-
Recognition, pages 6142–6151, 2020. 2
ages. ACM Transactions on Graphics (TOG), 42(1):1–13,
2022. 1, 2 [50] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali
[38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Dekel. Plug-and-play diffusion features for text-driven
Patrick Esser, and Björn Ommer. High-resolution image image-to-image translation. In Proceedings of the IEEE/CVF
synthesis with latent diffusion models. In Proceedings of Conference on Computer Vision and Pattern Recognition,
the IEEE/CVF Conference on Computer Vision and Pattern pages 1921–1930, 2023. 3, 8
Recognition, pages 10684–10695, 2022. 2, 3, 4, 5 [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[39] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
net: Convolutional networks for biomedical image segmen- Polosukhin. Attention is all you need. Advances in neural
tation. In Medical Image Computing and Computer-Assisted information processing systems, 30, 2017. 4
Intervention–MICCAI 2015: 18th International Conference, [52] Sheng-Yu Wang, David Bau, and Jun-Yan Zhu. Rewriting
Munich, Germany, October 5-9, 2015, Proceedings, Part III geometric rules of a gan. ACM Transactions on Graphics
18, pages 234–241. Springer, 2015. 4 (TOG), 41(4):1–16, 2022. 3
[40] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, [53] Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine Jian Zhang. Freedom: Training-free energy-guided condi-
tuning text-to-image diffusion models for subject-driven tional diffusion model. arXiv preprint arXiv:2303.09833,
generation. In Proceedings of the IEEE/CVF Conference 2023. 2
on Computer Vision and Pattern Recognition, pages 22500– [54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht-
22510, 2023. 3, 4 man, and Oliver Wang. The unreasonable effectiveness of
[41] Simo Ryu. Low-rank adaptation for fast text-to-image deep features as a perceptual metric. In Proceedings of the
diffusion fine-tuning. https : / / github . com / IEEE conference on computer vision and pattern recogni-
cloneofsimo/lora, 2022. 3 tion, pages 586–595, 2018. 5
[42] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [55] Jiapeng Zhu, Ceyuan Yang, Yujun Shen, Zifan Shi, Deli
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Zhao, and Qifeng Chen. Linkgan: Linking gan latents
8848
to pixels for controllable image synthesis. arXiv preprint
arXiv:2301.04604, 2023. 2
[56] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and
Alexei A Efros. Generative visual manipulation on the natu-
ral image manifold. In Computer Vision–ECCV 2016: 14th
European Conference, Amsterdam, The Netherlands, Octo-
ber 11-14, 2016, Proceedings, Part V 14, pages 597–613.
Springer, 2016. 2
8849