Adversarial Diffusion Distillation
Adversarial Diffusion Distillation
Figure 1. Generating high-fidelity 5122 images in a single step. All samples are generated with a single U-Net evaluation trained with
adversarial diffusion distillation (ADD).
Abstract 1. Introduction
1
Our approach is conceptually simple: We propose Ad-
versarial Diffusion Distillation (ADD), a general approach
that reduces the number of inference steps of a pre-trained
diffusion model to 1–4 sampling steps while maintaining
high sampling fidelity and potentially further improving the
overall performance of the model. To this end, we intro-
duce a combination of two training objectives: (i) an ad-
versarial loss and (ii) a distillation loss that corresponds
to score distillation sampling (SDS) [51]. The adversar-
ial loss forces the model to directly generate samples that
lie on the manifold of real images at each forward pass,
avoiding blurriness and other artifacts typically observed in
other distillation methods [43]. The distillation loss uses
another pretrained (and fixed) DM as a teacher to effectively
utilize the extensive knowledge of the pretrained DM and
preserve the strong compositionality observed in large DMs.
During inference, our approach does not use classifier-free
guidance [19], further reducing memory requirements. We
retain the model’s ability to improve results through iterative
refinement, which is an advantage over previous one-step Figure 2. Adversarial Diffusion Distillation. The ADD-student
GAN-based approaches [59]. is trained as a denoiser that receives diffused input images xs
Our contributions can be summarized as follows: and outputs samples x̂θ (xs , s) and optimizes two objectives: a)
adversarial loss: the model aims to fool a discriminator which is
• We introduce ADD, a method for turning pretrained diffu-
trained to distinguish the generated samples x̂θ from real images
sion models into high-fidelity, real-time image generators x0 . b) distillation loss: the model is trained to match the denoised
using only 1–4 sampling steps. targets x̂ψ of a frozen DM teacher.
• Our method uses a novel combination of adversarial train-
ing and score distillation, for which we carefully ablate
several design choices. on distilling latent diffusion models and achieve impressive
• ADD significantly outperforms strong baselines such as performance at 4 sampling steps. Recently, LCM-LoRA [40]
LCM, LCM-XL [38] and single-step GANs [59], and is introduced a low-rank adaptation [22] training for efficiently
able to handle complex image compositions while main- learning LCM modules, which can be plugged into differ-
taining high image realism at only a single inference step. ent checkpoints for SD and SDXL [50, 54]. InstaFlow [36]
• Using four sampling steps, ADD-XL outperforms its propose to use Rectified Flows [35] to facilitate a better
teacher model SDXL-Base at a resolution of 5122 px. distillation process.
All of these methods share common flaws: samples syn-
2. Background thesized in four steps often look blurry and exhibit noticeable
While diffusion models achieve remarkable performance in artifacts. At fewer sampling steps, this problem is further am-
synthesizing and editing high-resolution images [3, 53, 54] plified. GANs [14] can also be trained as standalone single-
and videos [4, 21], their iterative nature hinders real-time ap- step models for text-to-image synthesis [25, 59]. Their sam-
plication. Latent diffusion models [54] attempt to solve this pling speed is impressive, yet the performance lags behind
problem by representing images in a more computationally diffusion-based models. In part, this can be attributed to the
feasible latent space [11], but they still rely on the iterative finely balanced GAN-specific architectures necessary for sta-
application of large models with billions of parameters. In ble training of the adversarial objective. Scaling these mod-
addition to utilizing faster samplers for diffusion models els and integrating advances in neural network architectures
[8, 37, 64, 74], there is a growing body of research on model without disturbing the balance is notoriously challenging.
distillation such as progressive distillation [56] and guidance Additionally, current state-of-the-art text-to-image GANs
distillation [43]. These approaches reduce the number of do not have a method like classifier-free guidance available
iterative sampling steps to 4-8, but may significantly lower which is crucial for DMs at scale.
the original performance. Furthermore, they require an it- Score Distillation Sampling [51] also known as Score
erative training process. Consistency models [66] address Jacobian Chaining [68] is a recently proposed method that
the latter issue by enforcing a consistency regularization on has been developed to distill the knowledge of foundational
the ODE trajectory and demonstrate strong performance for T2I Models into 3D synthesis models. While the majority of
pixel-based models in the few-shot setting. LCMs [38] focus SDS-based works [45, 51, 68, 69] use SDS in the context of
2
A cinematic shot of a professor sloth wearing a tuxedo at a A high-quality photo of a confused bear in calculus class. The
BBQ party. bear is wearing a party hat and steampunk armor.
ADD-XL
(1 step)
LCM-XL
(1 step)
LCM-XL
(2 steps)
LCM-XL
(4 steps)
StyleGAN-T++
(1 step)
InstaFlow
(1 step)
OpenMUSE
(16 steps)
Figure 3. Qualitative comparison to state-of-the-art fast samplers. Single step samples from our ADD-XL (top) and LCM-XL [40], our
custom StyleGAN-T [59] baseline, InstaFlow [36] and MUSE. For MUSE, we use the OpenMUSE implementation and default inference
settings with 16 sampling steps. For LCM-XL, we sample with 1, 2 and 4 steps. Our model outperforms all other few-step samplers in a
single step.
per-scene optimization for 3D objects, the approach has also faster sampling, Denoising Diffusion GANs [70] are intro-
been applied to text-to-3D-video-synthesis [62] and in the duced as a method to enable sampling with few steps. To
context of image editing [16]. improve quality, a discriminator loss is added to the score
Recently, the authors of [13] have shown a strong relation- matching objective in Adversarial Score Matching [24] and
ship between score-based models and GANs and propose the consistency objective of CTM [29].
Score GANs, which are trained using score-based diffusion Our method combines adversarial training and score dis-
flows from a DM instead of a discriminator. Similarly, Diff- tillation in a hybrid objective to address the issues in current
Instruct [42], a method which generalizes SDS, enables to top performing few-step generative models.
distill a pretrained diffusion model into a generator without
discriminator. 3. Method
Conversely, there are also approaches which aim to im- Our goal is to generate high-fidelity samples in as few sam-
prove the diffusion process using adversarial training. For pling steps as possible, while matching the quality of state-
3
“A brain riding a rocketship heading “A bald eagle made of chocolate powder,
towards the moon.” mango, and whipped cream” “A blue colored dog.”
1 step
2 steps
4 steps
Figure 4. Qualitative effect of sampling steps. We show qualitative examples when sampling ADD-XL with 1, 2, and 4 steps. Single-step
samples are often already of high quality, but increasing the number of steps can further improve the consistency (e.g. second prompt, first
column) and attention to detail (e.g. second prompt, second column). The seeds are constant within columns and we see that the general
layout is preserved across sampling steps, allowing for fast exploration of outputs while retaining the possibility to refine.
of-the-art models [7, 50, 53, 55]. The adversarial objec- pure noise during inference.
tive [14, 60] naturally lends itself to fast generation as it For the adversarial objective, the generated samples x̂θ
trains a model that outputs samples on the image manifold in and real images x0 are passed to the discriminator which
a single forward step. However, attempts at scaling GANs to aims to distinguish between them. The design of the dis-
large datasets [58, 59] observed that is critical to not solely criminator and the adversarial loss are described in detail in
rely on the discriminator, but also employ a pretrained clas- Sec. 3.2. To distill knowledge from the DM teacher, we dif-
sifier or CLIP network for improving text alignment. As fuse student samples x̂θ with the teacher’s forward process to
remarked in [59], overly utilizing discriminative networks x̂θ,t , and use the teacher’s denoising prediction x̂ψ (x̂θ,t , t)
introduces artifacts and image quality suffers. Instead, we as a reconstruction target for the distillation loss Ldistill ,
utilize the gradient of a pretrained diffusion model via a score see Section 3.3. Thus, the overall objective is
distillation objective to improve text alignment and sample
quality. Furthermore, instead of training from scratch, we L = LG
adv (x̂θ (xs , s), ϕ) + λLdistill (x̂θ (xs , s), ψ) (1)
initialize our model with pretrained diffusion model weights;
While we formulate our method in pixel space, it is
pretraining the generator network is known to significantly
straightforward to adapt it to LDMs operating in latent space.
improve training with an adversarial loss [15]. Lastly, in-
When using LDMs with a shared latent space for teacher
stead of utilizing a decoder-only architecture used for GAN
and student, the distillation loss can be computed in pixel or
training [26, 27], we adapt a standard diffusion model frame-
latent space. We compute the distillation loss in pixel space
work. This setup naturally enables iterative refinement.
as this yields more stable gradients when distilling latent
diffusion model [72].
3.1. Training Procedure
3.2. Adversarial Loss
Our training procedure is outlined in Fig. 2 and involves three
networks: The ADD-student is initialized from a pretrained For the discriminator, we follow the proposed design and
UNet-DM with weights θ, a discriminator with trainable training procedure in [59] which we briefly summarize; for
weights ϕ, and a DM teacher with frozen weights ψ. Dur- details, we refer the reader to the original work. We use a
ing training, the ADD-student generates samples x̂θ (xs , s) frozen pretrained feature network F and a set of trainable
from noisy data xs . The noised data points are produced lightweight discriminator heads Dϕ,k . For the feature net-
from a dataset of real images x0 via a forward diffusion work F , Sauer et al. [59] find vision transformers (ViTs) [9]
process xs = αs x0 + σs ϵ. In our experiments, we use the to work well, and we ablate different choice for the ViTs
same coefficients αs and σs as the student DM and sam- objective and model size in Section 4. The trainable discrim-
ple s uniformly from a set Tstudent = {τ1 , ..., τn } of N inator heads are applied on features Fk at different layers of
chosen student timesteps. In practice, we choose N = 4. the feature network.
Importantly, we set τn = 1000 and enforce zero-terminal To improve performance, the discriminator can be condi-
SNR [33] during training, as the model needs to start from tioned on additional information via projection [46]. Com-
4
Figure 5. User preference study (single step). We compare the performance of ADD-XL (1-step) against established baselines. ADD-XL
model outperforms all models, except SDXL in human preference for both image quality and prompt alignment. Using more sampling steps
further improves our model (bottom row).
monly, a text embedding ctext is used in the text-to-image where sg denotes the stop-gradient operation. Intuitively,
setting. But, in contrast to standard GAN training, our train- the loss uses a distance metric d to measure the mis-
ing configuration also allows to condition on a given image. match between generated samples xθ by the ADD-student
For τ < 1000, the ADD-student receives some signal from and the DM-teacher’s outputs x̂ψ (x̂θ,t , t) = (x̂θ,t −
the input image x0 . Therefore, for a given generated sample σt ϵ̂ψ (x̂θ,t , t))/αt averaged over timesteps t and noise ϵ′ .
x̂θ (xs , s), we can condition the discriminator on information Notably, the teacher is not directly applied on generations
from x0 . This encourages the ADD-student to utilize the x̂θ of the ADD-student but instead on diffused outputs
input effectively. In practice, we use an additional feature x̂θ,t = αt x̂θ + σt ϵ′ , as non-diffused inputs would be out-of-
network to extract an image embedding cimg . distribution for the teacher model [68].
Following [57, 59], we use the hinge loss [32] as the In the following, we define the distance function
adversarial objective function. Thus the ADD-student’s ad- d(x, y) := ||x − y||22 . Regarding the weighting function
versarial objective Ladv (x̂θ (xs , s), ϕ) amounts to c(t), we consider two options: exponential weighting, where
c(t) = αt (higher noise levels contribute less), and score dis-
LG
adv (x̂θ (xs , s), ϕ) tillation sampling (SDS) weighting [51]. In the supplemen-
(2) tary material, we demonstrate that with d(x, y) = ||x − y||22
hX i
= −Es,ϵ,x0 Dϕ,k (Fk (x̂θ (xs , s))) ,
k
and a specific choice for c(t), our distillation loss becomes
equivalent to the SDS objective LSDS , as proposed in [51].
whereas the discriminator is trained to minimize The advantage of our formulation is its ability to enable
LD direct visualization of the reconstruction targets and that
adv (x̂θ (xs , s), ϕ)
hX i it naturally facilitates the execution of several consecutive
= Ex0 max(0, 1 − Dϕ,k (Fk (x0 ))) + γR1(ϕ) denoising steps. Lastly, we also evaluate noise-free score
k
hX i distillation (NFSD) objective, a recently proposed variant of
+ Ex̂θ max(0, 1 + Dϕ,k (Fk (x̂θ ))) , SDS [28].
k
(3) 4. Experiments
where R1 denotes the R1 gradient penalty [44]. Rather For our experiments, we train two models of different ca-
than computing the gradient penalty with respect to the pixel pacities, ADD-M (860M parameters) and ADD-XL (3.1B
values, we compute it on the input of each discriminator head parameters). For ablating ADD-M, we use a Stable Dif-
Dϕ,k . We find that the R1 penalty is particularly beneficial fusion (SD) 2.1 backbone [54], and for fair comparisons
when training at output resolutions larger than 1282 px. with other baselines, we use SD1.5. ADD-XL utilizes a
SDXL [50] backbone. All experiments are conducted at
3.3. Score Distillation Loss a standardized resolution of 512x512 pixels; outputs from
The distillation loss in Eq. (1) is formulated as models generating higher resolutions are down-sampled to
this size.
Ldistill (x̂θ (xs , s), ψ) We employ a distillation weighting factor of λ = 2.5
(4)
= Et,ϵ′ c(t)d(x̂θ , x̂ψ (sg(x̂θ,t ); t)) , across all experiments. Additionally, the R1 penalty strength
5
Arch Objective FID ↓ CS ↑ ctext cimg FID ↓ CS ↑ Initialization FID ↓ CS ↑
ViT-S DINOv1 21.5 0.312 ✗ ✗ 21.2 0.302 Random 293.6 0.065
ViT-S DINOv2 20.6 0.319 ✓ ✗ 21.2 0.307 Pretrained 20.6 0.319
ViT-L DINOv2 24.0 0.302 ✗ ✓ 21.1 0.316
ViT-L CLIP 23.3 0.308 ✓ ✓ 20.6 0.319
(a) Discriminator feature networks. Small, (b) Discriminator conditioning. Combining (c) Student pretraining. A randomly initial-
modern DINO networks perform best. image and text conditioning is most effective. ized student network collapses.
Loss FID ↓ CS ↑ Student Teacher FID ↓ CS ↑ Steps FID ↓ CS ↑
Ladv 20.8 0.315 SD2.1 SD2.1 20.6 0.319 1 20.6 0.319
Ldistill 315.6 0.076 SD2.1 SDXL 21.3 0.321 2 20.8 0.321
Ladv + λLdistill,exp 20.6 0.319 SDXL SD2.1 29.3 0.314 4 20.3 0.317
Ladv + λLdistill,sds 22.3 0.325 SDXL SDXL 28.41 0.325
Ladv +λLdistill,nfsd 21.8 0.327
(d) Loss terms. Both losses are needed and (e) Teacher type. The student adopts the (f) Teacher steps. A single teacher step is suffi-
exponential weighting of Ldistill is beneficial. teacher’s traits (SDXL has higher FID & CS). cient.
Table 1. ADD ablation study. We report COCO zero-shot FID5k (FID) and CLIP score (CS). The results are calculated for a single student
step. The default training settings are: DINOv2 ViT-S as the feature network, text and image conditioning for the discriminator, pretrained
student weights, a small teacher and student model, and a single teacher step. The training length is 4000 iterations with a batch size of 128.
Default settings are marked in gray .
γ is set to 10−5 . For discriminator conditioning, we use suited for evaluating the performance of generative models.
a pretrained CLIP-ViT-g-14 text encoder [52] to compute Similarly, these models also seem effective as discriminator
text embeddings ctext and the CLS embedding of a DINOv2 feature networks, with DINOv2 emerging as the best choice.
ViT-L encoder [47] for image embeddings cimg . For the
baselines, we use the best publicly available models: La- Discriminator conditioning. (Table 1b). Similar to prior
tent diffusion models [50, 54] (SD1.51 , SDXL2 ) cascaded studies, we observe that text conditioning of the discrimi-
pixel diffusion models [55] (IF-XL3 ), distilled diffusion mod- nator enhances results. Notably, image conditioning outper-
els [39, 41] (LCM-1.5, LCM-1.5-XL4 ), and OpenMUSE forms text conditioning, and the combination of both ctext
5
[48], a reimplementation of MUSE [6], a transformer and cimg yields the best results.
model specifically developed for fast inference. Note that
we compare to the SDXL-Base-1.0 model without its addi- Student pretraining. (Table 1c). Our experiments demon-
tional refiner model; this is to ensure a fair comparison. As strate the importance of pretraining the ADD-student. Being
there are no public state-of-the-art GAN models, we retrain able to use pretrained generators is a significant advantage
StyleGAN-T [59] with our improved discriminator. This over pure GAN approaches. A problem of GANs is the lack
baseline (StyleGAN-T++) significantly outperforms the pre- of scalability; both Sauer et al. [59] and Kang et al. [25]
vious best GANs in FID and CS, see supplementary. We observe a saturation of performance after a certain network
quantify sample quality via FID [18] and text alignment via capacity is reached. This observation contrasts the generally
CLIP score [17]. For CLIP score, we use ViT-g-14 model smooth scaling laws of DMs [49]. However, ADD can ef-
trained on LAION-2B [61]. Both metrics are evaluated on fectively leverage larger pretrained DMs (see Table 1c) and
5k samples from COCO2017 [34]. benefit from stable DM pretraining.
4.1. Ablation Study Loss terms. (Table 1d). We find that both losses are essen-
tial. The distillation loss on its own is not effective, but when
Our training setup opens up a number of design spaces re- combined with the adversarial loss, there is a noticeable im-
garding the adversarial loss, distillation loss, initialization, provement in results. Different weighting schedules lead
and loss interplay. We conduct an ablation study on several to different behaviours, the exponential schedule tends to
choices in Table 1; key insights are highlighted below each yield more diverse samples, as indicated by lower FID, SDS
table. We will discuss each experiment in the following. and NFSD schedules improve quality and text alignment.
Discriminator feature networks. (Table 1a). Recent While we use the exponential schedule as the default setting
insights by Stein et al. [67] suggest that ViTs trained with the in all other ablations, we opt for the NFSD weighting for
CLIP [52] or DINO [5, 47] objectives are particularly well- training our final model. Choosing an optimal weighting
function presents an opportunity for improvement. Alterna-
1 https://github.com/CompVis/stable-diffusion
2 https://github.com/Stability-AI/generative-models
tively, scheduling the distillation weights over training, as
3 https://github.com/deep-floyd/IF explored in the 3D generative modeling literature [23] could
4 https://huggingface.co/latent-consistency/lcm-lora-sdxl be considered.
5 https://huggingface.co/openMUSE
6
Figure 6. User preference study (multiple steps). We compare the performance of ADD-XL (4-step) against established baselines. Our
ADD-XL model outperforms all models, including its teacher SDXL 1.0 (base, no refiner) [50], in human preference for both image quality
and prompt alignment.
7
A photograph of the inside of a subway train. There are frogs
sitting on the seats. One of them is reading a newspaper. The
A cinematic shot of a little pig priest wearing sunglasses. window shows the river in the background.
ADD-XL
(4 steps)
SDXL-Base
(50 steps)
Figure 8. Qualitative comparison to the teacher model. ADD-XL can outperform its teacher model SDXL in the multi-step setting. The
adversarial loss boosts realism, particularly enhancing textures (fur, fabric, skin) while reducing oversmoothing, commonly observed in
diffusion model samples. ADD-XL’s overall sample diversity tends to be lower.
parisons in the supplementary material. Fig. 3 compares we retain the ability to refine samples using multiple steps.
ADD-XL (1 step) against the best current baselines in the In fact, using four sampling steps, our model outperforms
few-steps regime. Fig. 4 illustrates the iterative sampling widely used multi-step generators such as SDXL, IF, and
process of ADD-XL. These results showcase our model’s OpenMUSE.
ability to improve upon an initial sample. Such iterative Our model enables the generation of high quality images
improvement represents another significant benefit over pure in a single-step, opening up new possibilities for real-time
GAN approaches like StyleGAN-T++. Lastly, Fig. 8 com- generation with foundation models.
pares ADD-XL directly with its teacher model SDXL-Base.
As indicated by the user studies in Section 4.2, ADD-XL Acknowledgements
outperforms its teacher in both quality and prompt alignment.
We would like to thank Jonas Müller for feedback on the
The enhanced realism comes at the cost of slightly decreased
draft, the proof, and typesetting; Patrick Esser for feedback
sample diversity.
on the proof and building an early model demo; Frederic
Boesel for generating data and helpful discussions; Minguk
5. Discussion Kang and Taesung Park for providing GigaGAN samples;
This work introduces Adversarial Diffusion Distillation, a Richard Vencu, Harry Saini, and Sami Kama for maintaining
general method for distilling a pretrained diffusion model the compute infrastructure; Yara Wald for creative sampling
into a fast, few-step image generation model. We combine support; and Vanessa Sauer for her general support.
an adversarial and a score distillation objective to distill the
public Stable Diffusion [54] and SDXL [50] models, lever- References
aging both real data through the discriminator and structural [1] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep
understanding through the diffusion teacher. Our approach Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben
performs particularly well in the ultra-fast sampling regime Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds,
of one or two steps, and our analyses demonstrate that it out- Danny Hernandez, Jackson Kernion, Kamal Ndousse, Cather-
performs all concurrent methods in this regime. Furthermore,
8
ine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam and content-guided video synthesis with diffusion models,
McCandlish, Chris Olah, and Jared Kaplan. A general lan- 2023. 1
guage assistant as a laboratory for alignment, 2021. 13 [13] Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos,
[2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen,
Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, and Alain Rakotomamonjy. Unifying gans and score-based
Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- diffusion as generative particle models. arXiv preprint
davath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nel- arXiv:2305.16150, 2023. 3
son Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan [14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack and Yoshua Bengio. Generative adversarial networks. Com-
Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared munications of the ACM, 63:139 – 144, 2014. 1, 2, 4
Kaplan. Training a helpful and harmless assistant with rein- [15] Timofey Grigoryev, Andrey Voynov, and Artem Babenko.
forcement learning from human feedback, 2022. 13 When, why, and which pretrained gans are useful? ICLR,
[3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, 2022. 4
Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, [16] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de-
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and noising score. In Proceedings of the IEEE/CVF International
Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an Conference on Computer Vision, pages 2328–2337, 2023. 3
ensemble of expert denoisers. ArXiv, abs/2211.01324, 2022. [17] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
1, 2 and Yejin Choi. CLIPScore: A reference-free evaluation
[4] A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, metric for image captioning. In Proc. EMNLP, 2021. 6
Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
latents: High-resolution video synthesis with latent diffusion hard Nessler, and Sepp Hochreiter. GANs trained by a two
models. 2023 IEEE/CVF Conference on Computer Vision time-scale update rule converge to a local Nash equilibrium.
and Pattern Recognition (CVPR), pages 22563–22575, 2023. NeurIPS, 2017. 6, 12
1, 2 [19] Jonathan Ho. Classifier-free diffusion guidance. ArXiv,
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, abs/2207.12598, 2022. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [20] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion
ing properties in self-supervised vision transformers. In Pro- probabilistic models. ArXiv, abs/2006.11239, 2020. 1
ceedings of the IEEE/CVF international conference on com- [21] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
puter vision, pages 9650–9660, 2021. 6 Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben
[6] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans.
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, Imagen video: High definition video generation with diffusion
William T Freeman, Michael Rubinstein, et al. Muse: Text-to- models. ArXiv, abs/2210.02303, 2022. 1, 2
image generation via masked generative transformers. Proc. [22] J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
ICML, 2023. 6 Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank
[7] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang adaptation of large language models. ArXiv, abs/2106.09685,
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiao- 2021. 2
fang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image [23] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-
generation models using photogenic needles in a haystack. Jun Zha, and Lei Zhang. Dreamtime: An improved optimiza-
arXiv preprint arXiv:2309.15807, 2023. 4 tion strategy for text-to-3d content creation. arXiv preprint
[8] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Genie: arXiv:2306.12422, 2023. 6
Higher-order denoising diffusion solvers. Advances in Neural [24] Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Ta-
Information Processing Systems, 35:30150–30166, 2022. 2 chet des Combes, and Ioannis Mitliagkas. Adversarial score
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, matching and improved sampling for image generation. arXiv
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, preprint arXiv:2009.05475, 2020. 3
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [25] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli
vain Gelly, et al. An image is worth 16x16 words: Trans- Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans
formers for image recognition at scale. arXiv preprint for text-to-image synthesis. In Proceedings of the IEEE/CVF
arXiv:2010.11929, 2020. 4 Conference on Computer Vision and Pattern Recognition,
[10] Arpad E. Elo. The Rating of Chessplayers, Past and Present. pages 10124–10134, 2023. 1, 2, 6, 14
Arco Pub., New York, 1978. 13 [26] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[11] Patrick Esser, Robin Rombach, and Björn Ommer. Tam- generator architecture for generative adversarial networks.
ing transformers for high-resolution image synthesis. 2021 2019 IEEE/CVF Conference on Computer Vision and Pattern
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. 1, 4, 14
Recognition (CVPR), pages 12868–12878, 2020. 2 [27] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[12] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jaakko Lehtinen, and Timo Aila. Analyzing and improving
Jonathan Granskog, and Anastasis Germanidis. Structure the image quality of stylegan. 2020 IEEE/CVF Conference
9
on Computer Vision and Pattern Recognition (CVPR), pages proach for transferring knowledge from pre-trained diffusion
8107–8116, 2019. 1, 4 models. arXiv preprint arXiv:2305.18455, 2023. 3
[28] Oren Katzir, Or Patashnik, Daniel Cohen-Or, and Dani [43] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik
Lischinski. Noise-free score distillation. arXiv preprint Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans.
arXiv:2310.17590, 2023. 5 On distillation of guided diffusion models. In Proceedings of
[29] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- the IEEE/CVF Conference on Computer Vision and Pattern
rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Recognition, pages 14297–14306, 2023. 2, 7
Mitsufuji, and Stefano Ermon. Consistency trajectory models: [44] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Learning probability flow ode trajectory of diffusion. arXiv Which training methods for gans do actually converge? In
preprint arXiv:2310.02279, 2023. 3 International conference on machine learning, pages 3481–
[30] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- 3490. PMLR, 2018. 5
tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset [45] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and
of user preferences for text-to-image generation, 2023. 12 Daniel Cohen-Or. Latent-nerf for shape-guided generation
[31] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun of 3d shapes and textures. 2023 IEEE/CVF Conference on
Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Computer Vision and Pattern Recognition (CVPR), pages
Text-to-image diffusion model on mobile devices within two 12663–12673, 2022. 2
seconds. arXiv preprint arXiv:2306.00980, 2023. 7 [46] Takeru Miyato and Masanori Koyama. cgans with projection
[32] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv discriminator. arXiv preprint arXiv:1802.05637, 2018. 4
preprint arXiv:1705.02894, 2017. 5 [47] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo,
[33] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Com- Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel
mon diffusion noise schedules and sample steps are flawed, Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2:
2023. 4 Learning robust visual features without supervision. arXiv
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- preprint arXiv:2304.07193, 2023. 6
dev, Ross Girshick, James Hays, Pietro Perona, Deva Ra- [48] Suraj Patil, William Berman, and Patrick von Platen. Amused:
manan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft An open muse model. https://github.com/huggingface/
coco: Common objects in context, 2015. 6 diffusers, 2023. 6, 15
[35] Xingchao Liu, Chengyue Gong, et al. Flow straight and [49] William Peebles and Saining Xie. Scalable diffusion models
fast: Learning to generate and transfer data with rectified with transformers. In Proceedings of the IEEE/CVF Inter-
flow. In The Eleventh International Conference on Learning national Conference on Computer Vision, pages 4195–4205,
Representations, 2022. 2 2023. 6
[36] Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and [50] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann,
Qiang Liu. Instaflow: One step is enough for high-quality Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
diffusion-based text-to-image generation. arXiv preprint Sdxl: Improving latent diffusion models for high-resolution
arXiv:2309.06380, 2023. 2, 3, 7, 15 image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2,
[37] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan 4, 5, 6, 7, 8, 12, 13
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion [51] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall.
probabilistic model sampling in around 10 steps. Advances in Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint
Neural Information Processing Systems, 35:5775–5787, 2022. arXiv:2209.14988, 2022. 2, 5, 12
2, 7 [52] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[38] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Hang Zhao. Latent consistency models: Synthesizing Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
high-resolution images with few-step inference. ArXiv, transferable visual models from natural language supervi-
abs/2310.04378, 2023. 2, 13 sion. In International conference on machine learning, pages
[39] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang 8748–8763. PMLR, 2021. 6, 12
Zhao. Latent consistency models: Synthesizing high- [53] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
resolution images with few-step inference. arXiv preprint and Mark Chen. Hierarchical text-conditional image gener-
arXiv:2310.04378, 2023. 6 ation with clip latents. ArXiv, abs/2204.06125, 2022. 1, 2,
[40] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von 4
Platen, Apolin’ario Passos, Longbo Huang, Jian Li, and Hang [54] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick
Zhao. Lcm-lora: A universal stable-diffusion acceleration Esser, and Björn Ommer. High-resolution image synthesis
module. ArXiv, abs/2311.05556, 2023. 2, 3, 13, 15 with latent diffusion models. 2022 IEEE/CVF Conference
[41] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von on Computer Vision and Pattern Recognition (CVPR), pages
Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang 10674–10685, 2021. 1, 2, 5, 6, 8
Zhao. Lcm-lora: A universal stable-diffusion acceleration [55] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li,
module. arXiv preprint arXiv:2311.05556, 2023. 6 Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael
[42] Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal ap- torealistic text-to-image diffusion models with deep language
10
understanding. Advances in Neural Information Processing [70] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling
Systems, 35:36479–36494, 2022. 4, 6 the generative learning trilemma with denoising diffusion
[56] Tim Salimans and Jonathan Ho. Progressive distillation for gans. arXiv preprint arXiv:2112.07804, 2021. 3
fast sampling of diffusion models. CoRR, abs/2202.00512, [71] Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufo-
2022. 2 gen: You forward once large scale text-to-image generation
[57] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. via diffusion gans. arXiv preprint arXiv:2311.09257, 2023. 7
Projected gans converge faster. Advances in Neural Informa- [72] Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li,
tion Processing Systems, 34:17480–17492, 2021. 5 Michael Rubinstein, Ming-Hsuan Yang, and Varun Jampani.
[58] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Artic3d: Learning robust articulated 3d shapes from noisy
Scaling stylegan to large diverse datasets. ACM SIGGRAPH web image collections. arXiv preprint arXiv:2306.04619,
2022 Conference Proceedings, 2022. 1, 4 2023. 4
[59] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and [73] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan
Timo Aila. Stylegan-t: Unlocking the power of gans for fast Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei
large-scale text-to-image synthesis. Proc. ICML, 2023. 2, 3, Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana
4, 5, 6, 14 Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui
[60] Juergen Schmidhuber. Generative adversarial networks are Wu. Scaling autoregressive models for content-rich text-to-
special cases of artificial curiosity (1990) and also closely image generation, 2022. 12
related to predictability minimization (1991), 2020. 4 [74] Qinsheng Zhang and Yongxin Chen. Fast sampling of dif-
[61] Christoph Schuhmann, Romain Beaumont, Richard Vencu, fusion models with exponential integrator. arXiv preprint
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, arXiv:2204.13902, 2022. 2
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.
LAION-5B: An open large-scale dataset for training next
generation image-text models. In NeurIPS, 2022. 6
[62] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual,
Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea
Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy-
namic scene generation. arXiv preprint arXiv:2301.11280,
2023. 3
[63] Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Ma-
heswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. ArXiv,
abs/1503.03585, 2015. 1
[64] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. In International Conference on
Learning Representations, 2021. 2
[65] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P.
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differen-
tial equations. ArXiv, abs/2011.13456, 2020. 1
[66] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
Consistency models. In International Conference on Machine
Learning, 2023. 2
[67] George Stein, Jesse C Cresswell, Rasa Hosseinzadeh, Yi Sui,
Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, An-
thony L Caterini, J Eric T Taylor, and Gabriel Loaiza-Ganem.
Exposing flaws of generative model evaluation metrics and
their unfair treatment of diffusion models. arXiv preprint
arXiv:2306.04675, 2023. 6, 14
[68] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
and Greg Shakhnarovich. Score jacobian chaining: Lifting
pretrained 2d diffusion models for 3d generation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 12619–12629, 2023. 2, 5
[69] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and
diverse text-to-3d generation with variational score distilla-
tion. ArXiv, abs/2305.16213, 2023. 2
11
Appendix
A. SDS As a Special Case of the Distillation Loss
αt
If we set the weighting function to c(t) = 2σ t
w(t) where w(t) is the scaling factor from the weighted diffusion loss as in [51]
2
and choose d(x, y) = ||x − y||2 , the distillation loss in Eq. (4) is equivalent to the score distillation objective:
d MSE
L
dθ distill
h d i
= Et,ϵ′ c(t) ||x̂θ − x̂ψ (sg(x̂θ,t ); t)||22
dθ
h dx̂θ i
= Et,ϵ′ 2c(t)[x̂θ − x̂ψ (x̂θ,t ; t)]
dθ
hα 1 dx̂θ i
t
= Et,ϵ′ w(t)[ (x̂θ,t − x̂θ,t ) + x̂θ − x̂ψ (x̂θ,t ; t)] (5)
σt αt dθ
h1 dx̂θ i
= Et,ϵ′ w(t)[(αt x̂θ − x̂θ,t ) − (αt x̂ψ (x̂θ,t ; t) − x̂θ,t )]
σt dθ
h w(t) dx̂θ i
′
= Et,ϵ′ [−σt ϵ + σt ϵ̂θ (x̂θ,t ; t)]
σt dθ
d
= LSDS
dθ
Figure 9. User preference study (single step). We compare the performance of ADD-M (1-step) against established baselines.
12
Figure 10. User preference study (multiple steps). We compare the performance of ADD-XL (4-step) against established baselines.
1
E1 = R2 −R1 , (6)
1 + 10 400
1
E2 = R1 −R2 . (7)
1 + 10 400
After observing the result of the game, the ratings Ri are updated via the rule
′
Ri = Ri + K · (Si − Ei ) , i ∈ {1, 2} (8)
where Si indicates the outcome of the match for player i. In our case we have Si = 1 if player i wins and Si = 0 if player
i looses. The constant K can be see as weight putting emphasis on more recent games. We choose K = 1 and bootstrap
the final ELO ranking for a given series of comparisons based on 1000 individual ELO ranking calculations with randomly
shuffled order. Before comparing the models we choose the start rating for every model as Rinit = 1000.
6 https://huggingface.co/openMUSE
7 https://github.com/deep-floyd/IF
8 https://huggingface.co/latent-consistency/lcm-lora-sdxl
9 https://app.prolific.com
13
C. GAN Baselines Comparison
For training our state-of-the-art GAN baseline StyleGAN-T++, we follow the training procedure outlined in [59]. The main
differences are extended training (∼2M iterations with a batch size of 2048, which is comparable to GigaGAN’s schedule [25]),
the improved discriminator architecture proposed in Section 3.2, and R1 penalty applied at each discriminator head.
Fig. 11 shows that StyleGAN-T++ outperforms the previous best GANs by achieving a comparable zero-shot FID to
GigaGAN at a significantly higher CLIP score. Here, we do not compare to DMs, as comparisons between model classes via
automatic metrics tend to be less informative [67]. As an example, GigaGAN achieves FID and CLIP scores comparable to
SD1.5, but its sample quality is still inferior, as noted by the authors.
26
GigaGAN
Zero-shot FID5k
24 StyleGAN-T
StyleGAN-T++
22
20
18
16
14
0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36
CLIP score (ViT-g-14)
Figure 11. Comparing text alignment tradeoffs at 256 × 256 pixels. We compare FID–CLIP score curves of StyleGAN-T, StyleGAN-T++,
and GigaGAN. For increasing CLIP score, all methods use via decreasing truncation [26] for values ψ = {1.0, 0.9, . . . , 0.3}.
Figure 12. Additional single step 5122 images generated with ADD-XL. All samples are generated with a single U-Net evaluation trained
with adversarial diffusion distillation (ADD).
14
D. Additional Samples
We show additional one-step samples as in Figure 1 in Figure 12. An additional qualitative comparison as in Figure 4 which
demonstrates that our model can further refine quality by using more than one sampling step is provided in Figure 14, where
we show that, while sampling quality with a single step is already high, more steps can give higher diversity and better spelling
capabilities. Lastly, we provide an additional qualitative comparison of ADD-XL to other state-of-the-art one and few-step
models in Figure 13.
Figure 13. Additional qualitative comparisons to state of the art fast samplers. Few step samples from our ADD-XL and LCM-XL [40],
InstaFlow [36], and OpenMuse [48].
15
“A portrait photo of a kangaroo wearing an orange hoodie and
blue sunglasses standing on the grass in front of the Sydney
“a robot is playing the guitar at a rock concert in front of a large Opera House holding a sign on the chest that says Welcome
crowd.” Friends!”
1 step
2 steps
4 steps
Figure 14. Additional results on the qualitative effect of sampling steps. Similar to Figure 4, we show qualitative examples when
sampling ADD-XL with 1, 2, and 4 steps. Single-step samples are often already of high quality, but increasing the number of steps can
further improve the diversity (left) and spelling capabilities (right). The seeds are constant within columns and we see that the general layout
is preserved across sampling steps, allowing for fast exploration of outputs while retaining the possibility to refine.
16