0% found this document useful (0 votes)

73 views16 pages

Adversarial Diffusion Distillation

Uploaded by

Dmitriy Kearo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views16 pages

Adversarial Diffusion Distillation

Uploaded by

Dmitriy Kearo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Adversarial Diffusion Distillation

Axel Sauer Dominik Lorenz Andreas Blattmann Robin Rombach

Stability AI
Code: https://github.com/Stability-AI/generative-models Model weights: https://huggingface.co/stabilityai/

Figure 1. Generating high-fidelity 5122 images in a single step. All samples are generated with a single U-Net evaluation trained with
adversarial diffusion distillation (ADD).

Abstract 1. Introduction

Diffusion models (DMs) [20, 63, 65] have taken a central

role in the field of generative modeling and have recently en-
We introduce Adversarial Diffusion Distillation (ADD), a abled remarkable advances in high-quality image- [3, 53, 54]
novel training approach that efficiently samples large-scale and video- [4, 12, 21] synthesis. One of the key strengths of
foundational image diffusion models in just 1–4 steps while DMs is their scalability and iterative nature, which allows
maintaining high image quality. We use score distillation them to handle complex tasks such as image synthesis from
to leverage large-scale off-the-shelf image diffusion models free-form text prompts. However, the iterative inference
as a teacher signal in combination with an adversarial loss process in DMs requires a significant number of sampling
to ensure high image fidelity even in the low-step regime steps, which currently hinders their real-time application.
of one or two sampling steps. Our analyses show that our Generative Adversarial Networks (GANs) [14, 26, 27], on
model clearly outperforms existing few-step methods (GANs, the other hand, are characterized by their single-step for-
Latent Consistency Models) in a single step and reaches the mulation and inherent speed. But despite attempts to scale
performance of state-of-the-art diffusion models (SDXL) in to large datasets[25, 58], GANs often fall short of DMs in
only four steps. ADD is the first method to unlock single-step, terms of sample quality. The aim of this work is to combine
real-time image synthesis with foundation models. the superior sample quality of DMs with the inherent speed
of GANs.

1
Our approach is conceptually simple: We propose Ad-
versarial Diffusion Distillation (ADD), a general approach
that reduces the number of inference steps of a pre-trained
diffusion model to 1–4 sampling steps while maintaining
high sampling fidelity and potentially further improving the
overall performance of the model. To this end, we intro-
duce a combination of two training objectives: (i) an ad-
versarial loss and (ii) a distillation loss that corresponds
to score distillation sampling (SDS) [51]. The adversar-
ial loss forces the model to directly generate samples that
lie on the manifold of real images at each forward pass,
avoiding blurriness and other artifacts typically observed in
other distillation methods [43]. The distillation loss uses
another pretrained (and fixed) DM as a teacher to effectively
utilize the extensive knowledge of the pretrained DM and
preserve the strong compositionality observed in large DMs.
During inference, our approach does not use classifier-free
guidance [19], further reducing memory requirements. We
retain the model’s ability to improve results through iterative
refinement, which is an advantage over previous one-step Figure 2. Adversarial Diffusion Distillation. The ADD-student
GAN-based approaches [59]. is trained as a denoiser that receives diffused input images xs
Our contributions can be summarized as follows: and outputs samples x̂θ (xs , s) and optimizes two objectives: a)
adversarial loss: the model aims to fool a discriminator which is
• We introduce ADD, a method for turning pretrained diffu-
trained to distinguish the generated samples x̂θ from real images
sion models into high-fidelity, real-time image generators x0 . b) distillation loss: the model is trained to match the denoised
using only 1–4 sampling steps. targets x̂ψ of a frozen DM teacher.
• Our method uses a novel combination of adversarial train-
ing and score distillation, for which we carefully ablate
several design choices. on distilling latent diffusion models and achieve impressive
• ADD significantly outperforms strong baselines such as performance at 4 sampling steps. Recently, LCM-LoRA [40]
LCM, LCM-XL [38] and single-step GANs [59], and is introduced a low-rank adaptation [22] training for efficiently
able to handle complex image compositions while main- learning LCM modules, which can be plugged into differ-
taining high image realism at only a single inference step. ent checkpoints for SD and SDXL [50, 54]. InstaFlow [36]
• Using four sampling steps, ADD-XL outperforms its propose to use Rectified Flows [35] to facilitate a better
teacher model SDXL-Base at a resolution of 5122 px. distillation process.
All of these methods share common flaws: samples syn-
2. Background thesized in four steps often look blurry and exhibit noticeable
While diffusion models achieve remarkable performance in artifacts. At fewer sampling steps, this problem is further am-
synthesizing and editing high-resolution images [3, 53, 54] plified. GANs [14] can also be trained as standalone single-
and videos [4, 21], their iterative nature hinders real-time ap- step models for text-to-image synthesis [25, 59]. Their sam-
plication. Latent diffusion models [54] attempt to solve this pling speed is impressive, yet the performance lags behind
problem by representing images in a more computationally diffusion-based models. In part, this can be attributed to the
feasible latent space [11], but they still rely on the iterative finely balanced GAN-specific architectures necessary for sta-
application of large models with billions of parameters. In ble training of the adversarial objective. Scaling these mod-
addition to utilizing faster samplers for diffusion models els and integrating advances in neural network architectures
[8, 37, 64, 74], there is a growing body of research on model without disturbing the balance is notoriously challenging.
distillation such as progressive distillation [56] and guidance Additionally, current state-of-the-art text-to-image GANs
distillation [43]. These approaches reduce the number of do not have a method like classifier-free guidance available
iterative sampling steps to 4-8, but may significantly lower which is crucial for DMs at scale.
the original performance. Furthermore, they require an it- Score Distillation Sampling [51] also known as Score
erative training process. Consistency models [66] address Jacobian Chaining [68] is a recently proposed method that
the latter issue by enforcing a consistency regularization on has been developed to distill the knowledge of foundational
the ODE trajectory and demonstrate strong performance for T2I Models into 3D synthesis models. While the majority of
pixel-based models in the few-shot setting. LCMs [38] focus SDS-based works [45, 51, 68, 69] use SDS in the context of

2
A cinematic shot of a professor sloth wearing a tuxedo at a A high-quality photo of a confused bear in calculus class. The
BBQ party. bear is wearing a party hat and steampunk armor.
ADD-XL
(1 step)
LCM-XL
(1 step)
LCM-XL
(2 steps)
LCM-XL
(4 steps)
StyleGAN-T++
(1 step)
InstaFlow
(1 step)
OpenMUSE
(16 steps)

Figure 3. Qualitative comparison to state-of-the-art fast samplers. Single step samples from our ADD-XL (top) and LCM-XL [40], our
custom StyleGAN-T [59] baseline, InstaFlow [36] and MUSE. For MUSE, we use the OpenMUSE implementation and default inference
settings with 16 sampling steps. For LCM-XL, we sample with 1, 2 and 4 steps. Our model outperforms all other few-step samplers in a
single step.

per-scene optimization for 3D objects, the approach has also faster sampling, Denoising Diffusion GANs [70] are intro-
been applied to text-to-3D-video-synthesis [62] and in the duced as a method to enable sampling with few steps. To
context of image editing [16]. improve quality, a discriminator loss is added to the score
Recently, the authors of [13] have shown a strong relation- matching objective in Adversarial Score Matching [24] and
ship between score-based models and GANs and propose the consistency objective of CTM [29].
Score GANs, which are trained using score-based diffusion Our method combines adversarial training and score dis-
flows from a DM instead of a discriminator. Similarly, Diff- tillation in a hybrid objective to address the issues in current
Instruct [42], a method which generalizes SDS, enables to top performing few-step generative models.
distill a pretrained diffusion model into a generator without
discriminator. 3. Method
Conversely, there are also approaches which aim to im- Our goal is to generate high-fidelity samples in as few sam-
prove the diffusion process using adversarial training. For pling steps as possible, while matching the quality of state-

3
“A brain riding a rocketship heading “A bald eagle made of chocolate powder,
towards the moon.” mango, and whipped cream” “A blue colored dog.”
1 step
2 steps
4 steps

Figure 4. Qualitative effect of sampling steps. We show qualitative examples when sampling ADD-XL with 1, 2, and 4 steps. Single-step
samples are often already of high quality, but increasing the number of steps can further improve the consistency (e.g. second prompt, first
column) and attention to detail (e.g. second prompt, second column). The seeds are constant within columns and we see that the general
layout is preserved across sampling steps, allowing for fast exploration of outputs while retaining the possibility to refine.

of-the-art models [7, 50, 53, 55]. The adversarial objec- pure noise during inference.
tive [14, 60] naturally lends itself to fast generation as it For the adversarial objective, the generated samples x̂θ
trains a model that outputs samples on the image manifold in and real images x0 are passed to the discriminator which
a single forward step. However, attempts at scaling GANs to aims to distinguish between them. The design of the dis-
large datasets [58, 59] observed that is critical to not solely criminator and the adversarial loss are described in detail in
rely on the discriminator, but also employ a pretrained clas- Sec. 3.2. To distill knowledge from the DM teacher, we dif-
sifier or CLIP network for improving text alignment. As fuse student samples x̂θ with the teacher’s forward process to
remarked in [59], overly utilizing discriminative networks x̂θ,t , and use the teacher’s denoising prediction x̂ψ (x̂θ,t , t)
introduces artifacts and image quality suffers. Instead, we as a reconstruction target for the distillation loss Ldistill ,
utilize the gradient of a pretrained diffusion model via a score see Section 3.3. Thus, the overall objective is
distillation objective to improve text alignment and sample
quality. Furthermore, instead of training from scratch, we L = LG
adv (x̂θ (xs , s), ϕ) + λLdistill (x̂θ (xs , s), ψ) (1)
initialize our model with pretrained diffusion model weights;
While we formulate our method in pixel space, it is
pretraining the generator network is known to significantly
straightforward to adapt it to LDMs operating in latent space.
improve training with an adversarial loss [15]. Lastly, in-
When using LDMs with a shared latent space for teacher
stead of utilizing a decoder-only architecture used for GAN
and student, the distillation loss can be computed in pixel or
training [26, 27], we adapt a standard diffusion model frame-
latent space. We compute the distillation loss in pixel space
work. This setup naturally enables iterative refinement.
as this yields more stable gradients when distilling latent
diffusion model [72].
3.1. Training Procedure
3.2. Adversarial Loss
Our training procedure is outlined in Fig. 2 and involves three
networks: The ADD-student is initialized from a pretrained For the discriminator, we follow the proposed design and
UNet-DM with weights θ, a discriminator with trainable training procedure in [59] which we briefly summarize; for
weights ϕ, and a DM teacher with frozen weights ψ. Dur- details, we refer the reader to the original work. We use a
ing training, the ADD-student generates samples x̂θ (xs , s) frozen pretrained feature network F and a set of trainable
from noisy data xs . The noised data points are produced lightweight discriminator heads Dϕ,k . For the feature net-
from a dataset of real images x0 via a forward diffusion work F , Sauer et al. [59] find vision transformers (ViTs) [9]
process xs = αs x0 + σs ϵ. In our experiments, we use the to work well, and we ablate different choice for the ViTs
same coefficients αs and σs as the student DM and sam- objective and model size in Section 4. The trainable discrim-
ple s uniformly from a set Tstudent = {τ1 , ..., τn } of N inator heads are applied on features Fk at different layers of
chosen student timesteps. In practice, we choose N = 4. the feature network.
Importantly, we set τn = 1000 and enforce zero-terminal To improve performance, the discriminator can be condi-
SNR [33] during training, as the model needs to start from tioned on additional information via projection [46]. Com-

4
Figure 5. User preference study (single step). We compare the performance of ADD-XL (1-step) against established baselines. ADD-XL
model outperforms all models, except SDXL in human preference for both image quality and prompt alignment. Using more sampling steps
further improves our model (bottom row).

monly, a text embedding ctext is used in the text-to-image where sg denotes the stop-gradient operation. Intuitively,
setting. But, in contrast to standard GAN training, our train- the loss uses a distance metric d to measure the mis-
ing configuration also allows to condition on a given image. match between generated samples xθ by the ADD-student
For τ < 1000, the ADD-student receives some signal from and the DM-teacher’s outputs x̂ψ (x̂θ,t , t) = (x̂θ,t −
the input image x0 . Therefore, for a given generated sample σt ϵ̂ψ (x̂θ,t , t))/αt averaged over timesteps t and noise ϵ′ .
x̂θ (xs , s), we can condition the discriminator on information Notably, the teacher is not directly applied on generations
from x0 . This encourages the ADD-student to utilize the x̂θ of the ADD-student but instead on diffused outputs
input effectively. In practice, we use an additional feature x̂θ,t = αt x̂θ + σt ϵ′ , as non-diffused inputs would be out-of-
network to extract an image embedding cimg . distribution for the teacher model [68].
Following [57, 59], we use the hinge loss [32] as the In the following, we define the distance function
adversarial objective function. Thus the ADD-student’s ad- d(x, y) := ||x − y||22 . Regarding the weighting function
versarial objective Ladv (x̂θ (xs , s), ϕ) amounts to c(t), we consider two options: exponential weighting, where
c(t) = αt (higher noise levels contribute less), and score dis-
LG
adv (x̂θ (xs , s), ϕ) tillation sampling (SDS) weighting [51]. In the supplemen-
(2) tary material, we demonstrate that with d(x, y) = ||x − y||22
hX i
= −Es,ϵ,x0 Dϕ,k (Fk (x̂θ (xs , s))) ,
k
and a specific choice for c(t), our distillation loss becomes
equivalent to the SDS objective LSDS , as proposed in [51].
whereas the discriminator is trained to minimize The advantage of our formulation is its ability to enable
LD direct visualization of the reconstruction targets and that
adv (x̂θ (xs , s), ϕ)
hX i it naturally facilitates the execution of several consecutive
= Ex0 max(0, 1 − Dϕ,k (Fk (x0 ))) + γR1(ϕ) denoising steps. Lastly, we also evaluate noise-free score
k
hX i distillation (NFSD) objective, a recently proposed variant of
+ Ex̂θ max(0, 1 + Dϕ,k (Fk (x̂θ ))) , SDS [28].
k
(3) 4. Experiments
where R1 denotes the R1 gradient penalty [44]. Rather For our experiments, we train two models of different ca-
than computing the gradient penalty with respect to the pixel pacities, ADD-M (860M parameters) and ADD-XL (3.1B
values, we compute it on the input of each discriminator head parameters). For ablating ADD-M, we use a Stable Dif-
Dϕ,k . We find that the R1 penalty is particularly beneficial fusion (SD) 2.1 backbone [54], and for fair comparisons
when training at output resolutions larger than 1282 px. with other baselines, we use SD1.5. ADD-XL utilizes a
SDXL [50] backbone. All experiments are conducted at
3.3. Score Distillation Loss a standardized resolution of 512x512 pixels; outputs from
The distillation loss in Eq. (1) is formulated as models generating higher resolutions are down-sampled to
this size.
Ldistill (x̂θ (xs , s), ψ) We employ a distillation weighting factor of λ = 2.5
(4)
= Et,ϵ′ c(t)d(x̂θ , x̂ψ (sg(x̂θ,t ); t)) , across all experiments. Additionally, the R1 penalty strength
5
Arch Objective FID ↓ CS ↑ ctext cimg FID ↓ CS ↑ Initialization FID ↓ CS ↑
ViT-S DINOv1 21.5 0.312 ✗ ✗ 21.2 0.302 Random 293.6 0.065
ViT-S DINOv2 20.6 0.319 ✓ ✗ 21.2 0.307 Pretrained 20.6 0.319
ViT-L DINOv2 24.0 0.302 ✗ ✓ 21.1 0.316
ViT-L CLIP 23.3 0.308 ✓ ✓ 20.6 0.319
(a) Discriminator feature networks. Small, (b) Discriminator conditioning. Combining (c) Student pretraining. A randomly initial-
modern DINO networks perform best. image and text conditioning is most effective. ized student network collapses.
Loss FID ↓ CS ↑ Student Teacher FID ↓ CS ↑ Steps FID ↓ CS ↑
Ladv 20.8 0.315 SD2.1 SD2.1 20.6 0.319 1 20.6 0.319
Ldistill 315.6 0.076 SD2.1 SDXL 21.3 0.321 2 20.8 0.321
Ladv + λLdistill,exp 20.6 0.319 SDXL SD2.1 29.3 0.314 4 20.3 0.317
Ladv + λLdistill,sds 22.3 0.325 SDXL SDXL 28.41 0.325
Ladv +λLdistill,nfsd 21.8 0.327
(d) Loss terms. Both losses are needed and (e) Teacher type. The student adopts the (f) Teacher steps. A single teacher step is suffi-
exponential weighting of Ldistill is beneficial. teacher’s traits (SDXL has higher FID & CS). cient.

Table 1. ADD ablation study. We report COCO zero-shot FID5k (FID) and CLIP score (CS). The results are calculated for a single student
step. The default training settings are: DINOv2 ViT-S as the feature network, text and image conditioning for the discriminator, pretrained
student weights, a small teacher and student model, and a single teacher step. The training length is 4000 iterations with a batch size of 128.
Default settings are marked in gray .

γ is set to 10−5 . For discriminator conditioning, we use suited for evaluating the performance of generative models.
a pretrained CLIP-ViT-g-14 text encoder [52] to compute Similarly, these models also seem effective as discriminator
text embeddings ctext and the CLS embedding of a DINOv2 feature networks, with DINOv2 emerging as the best choice.
ViT-L encoder [47] for image embeddings cimg . For the
baselines, we use the best publicly available models: La- Discriminator conditioning. (Table 1b). Similar to prior
tent diffusion models [50, 54] (SD1.51 , SDXL2 ) cascaded studies, we observe that text conditioning of the discrimi-
pixel diffusion models [55] (IF-XL3 ), distilled diffusion mod- nator enhances results. Notably, image conditioning outper-
els [39, 41] (LCM-1.5, LCM-1.5-XL4 ), and OpenMUSE forms text conditioning, and the combination of both ctext
5
[48], a reimplementation of MUSE [6], a transformer and cimg yields the best results.
model specifically developed for fast inference. Note that
we compare to the SDXL-Base-1.0 model without its addi- Student pretraining. (Table 1c). Our experiments demon-
tional refiner model; this is to ensure a fair comparison. As strate the importance of pretraining the ADD-student. Being
there are no public state-of-the-art GAN models, we retrain able to use pretrained generators is a significant advantage
StyleGAN-T [59] with our improved discriminator. This over pure GAN approaches. A problem of GANs is the lack
baseline (StyleGAN-T++) significantly outperforms the pre- of scalability; both Sauer et al. [59] and Kang et al. [25]
vious best GANs in FID and CS, see supplementary. We observe a saturation of performance after a certain network
quantify sample quality via FID [18] and text alignment via capacity is reached. This observation contrasts the generally
CLIP score [17]. For CLIP score, we use ViT-g-14 model smooth scaling laws of DMs [49]. However, ADD can ef-
trained on LAION-2B [61]. Both metrics are evaluated on fectively leverage larger pretrained DMs (see Table 1c) and
5k samples from COCO2017 [34]. benefit from stable DM pretraining.

4.1. Ablation Study Loss terms. (Table 1d). We find that both losses are essen-
tial. The distillation loss on its own is not effective, but when
Our training setup opens up a number of design spaces re- combined with the adversarial loss, there is a noticeable im-
garding the adversarial loss, distillation loss, initialization, provement in results. Different weighting schedules lead
and loss interplay. We conduct an ablation study on several to different behaviours, the exponential schedule tends to
choices in Table 1; key insights are highlighted below each yield more diverse samples, as indicated by lower FID, SDS
table. We will discuss each experiment in the following. and NFSD schedules improve quality and text alignment.
Discriminator feature networks. (Table 1a). Recent While we use the exponential schedule as the default setting
insights by Stein et al. [67] suggest that ViTs trained with the in all other ablations, we opt for the NFSD weighting for
CLIP [52] or DINO [5, 47] objectives are particularly well- training our final model. Choosing an optimal weighting
function presents an opportunity for improvement. Alterna-
1 https://github.com/CompVis/stable-diffusion
2 https://github.com/Stability-AI/generative-models
tively, scheduling the distillation weights over training, as
3 https://github.com/deep-floyd/IF explored in the 3D generative modeling literature [23] could
4 https://huggingface.co/latent-consistency/lcm-lora-sdxl be considered.
5 https://huggingface.co/openMUSE

6
Figure 6. User preference study (multiple steps). We compare the performance of ADD-XL (4-step) against established baselines. Our
ADD-XL model outperforms all models, including its teacher SDXL 1.0 (base, no refiner) [50], in human preference for both image quality
and prompt alignment.

Method #Steps Time (s) FID ↓ CLIP ↑

12 IF-XL
25 0.88 20.1 0.318 10
DPM Solver [37] (150 steps)
Inference speed [s] ↓
8 0.34 31.7 0.320
1 0.09 37.2 0.275 SDXL
Progressive Distillation [43] 2 0.13 26.0 0.297 5 (50 steps)
4 0.21 26.4 0.300
3
CFG-Aware Distillation [31] 8 0.34 24.2 0.300 OpenMUSE
InstaFlow-0.9B [36] 1 0.09 23.4 0.304 (16 steps) LCM-XL ADD-XL
InstaFlow-1.7B [36] 1 0.12 22.4 0.309 1 (4 steps) (4 steps)
UFOGen [71] 1 0.09 22.5 0.311 ADD-XL
(1 step)
ADD-M 1 0.09 19.7 0.326 0
900 950 1,000 1,050 1,100 1,150 1,200
Table 2. Distillation Comparison We compare ADD to other ELO ↑
distillation methods via COCO zero-shot FID5k (FID) and CLIP Figure 7. Performance vs. speed. We visualize the results reported
score (CS). All models are based on SD1.5. in Fig. 6 in combination with the inference speeds of the respective
models. The speeds are calculated for generating a single sample
at resolution 512x512 on an A100 in mixed precision.
Teacher type. (Table 1e). Interestingly, a bigger student
and teacher does not necessarily result in better FID and
CS. Rather, the student adopts the teachers characteristics. between both prompt following and image quality. Details
SDXL obtains generally higher FID, possibly because of its on the ELO score computation and the study parameters are
less diverse output, yet it exhibits higher image quality and listed in the supplementary material.
text alignment [50]. Fig. 5 and Fig. 6 present the study results. The most im-
portant results are: First, ADD-XL outperforms LCM-XL (4
Teacher steps. (Table 1f). While our distillation loss steps) with a single step. Second, ADD-XL can beat SDXL
formulation allows taking several consecutive steps with the (50 steps) with four steps in the majority of comparisons.
teacher by construction, we find that several steps do not This makes ADD-XL the state-of-the-art in both the single
conclusively result in better performance. and the multiple steps setting. Fig. 7 visualizes ELO scores
relative to inference speed. Lastly, Table 2 compares dif-
4.2. Quantitative Comparison to State-of-the-Art ferent few-step sampling and distillation methods using the
For our main comparison with other approaches, we refrain same base model. ADD outperforms all other approaches
from using automated metrics, as user preference studies including the standard DPM solver with eight steps.
are more reliable [50]. In the study, we aim to assess both
4.3. Qualitative Results
prompt adherence and the overall image. As a performance
measure, we compute win percentages for pairwise compar- To complement our quantitative studies above, we present
isons and ELO scores when comparing several approaches. qualitative results in this section. To paint a more complete
For the reported ELO scores we calculate the mean scores picture, we provide additional samples and qualitative com-

7
A photograph of the inside of a subway train. There are frogs
sitting on the seats. One of them is reading a newspaper. The
A cinematic shot of a little pig priest wearing sunglasses. window shows the river in the background.
ADD-XL
(4 steps)
SDXL-Base
(50 steps)

A photo of an astronaut riding a horse in the forest. There is a

river in front of them with water lilies. A photo of a cute mouse wearing a crown.
ADD-XL
(4 steps)
SDXL-Base
(50 steps)

Figure 8. Qualitative comparison to the teacher model. ADD-XL can outperform its teacher model SDXL in the multi-step setting. The
adversarial loss boosts realism, particularly enhancing textures (fur, fabric, skin) while reducing oversmoothing, commonly observed in
diffusion model samples. ADD-XL’s overall sample diversity tends to be lower.

parisons in the supplementary material. Fig. 3 compares we retain the ability to refine samples using multiple steps.
ADD-XL (1 step) against the best current baselines in the In fact, using four sampling steps, our model outperforms
few-steps regime. Fig. 4 illustrates the iterative sampling widely used multi-step generators such as SDXL, IF, and
process of ADD-XL. These results showcase our model’s OpenMUSE.
ability to improve upon an initial sample. Such iterative Our model enables the generation of high quality images
improvement represents another significant benefit over pure in a single-step, opening up new possibilities for real-time
GAN approaches like StyleGAN-T++. Lastly, Fig. 8 com- generation with foundation models.
pares ADD-XL directly with its teacher model SDXL-Base.
As indicated by the user studies in Section 4.2, ADD-XL Acknowledgements
outperforms its teacher in both quality and prompt alignment.
We would like to thank Jonas Müller for feedback on the
The enhanced realism comes at the cost of slightly decreased
draft, the proof, and typesetting; Patrick Esser for feedback
sample diversity.
on the proof and building an early model demo; Frederic
Boesel for generating data and helpful discussions; Minguk
5. Discussion Kang and Taesung Park for providing GigaGAN samples;
This work introduces Adversarial Diffusion Distillation, a Richard Vencu, Harry Saini, and Sami Kama for maintaining
general method for distilling a pretrained diffusion model the compute infrastructure; Yara Wald for creative sampling
into a fast, few-step image generation model. We combine support; and Vanessa Sauer for her general support.
an adversarial and a score distillation objective to distill the
public Stable Diffusion [54] and SDXL [50] models, lever- References
aging both real data through the discriminator and structural [1] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep
understanding through the diffusion teacher. Our approach Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben
performs particularly well in the ultra-fast sampling regime Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds,
of one or two steps, and our analyses demonstrate that it out- Danny Hernandez, Jackson Kernion, Kamal Ndousse, Cather-
performs all concurrent methods in this regime. Furthermore,
8
ine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam and content-guided video synthesis with diffusion models,
McCandlish, Chris Olah, and Jared Kaplan. A general lan- 2023. 1
guage assistant as a laboratory for alignment, 2021. 13 [13] Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos,
[2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen,
Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, and Alain Rakotomamonjy. Unifying gans and score-based
Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- diffusion as generative particle models. arXiv preprint
davath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nel- arXiv:2305.16150, 2023. 3
son Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan [14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville,
Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack and Yoshua Bengio. Generative adversarial networks. Com-
Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared munications of the ACM, 63:139 – 144, 2014. 1, 2, 4
Kaplan. Training a helpful and harmless assistant with rein- [15] Timofey Grigoryev, Andrey Voynov, and Artem Babenko.
forcement learning from human feedback, 2022. 13 When, why, and which pretrained gans are useful? ICLR,
[3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, 2022. 4
Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, [16] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta de-
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and noising score. In Proceedings of the IEEE/CVF International
Ming-Yu Liu. ediff-i: Text-to-image diffusion models with an Conference on Computer Vision, pages 2328–2337, 2023. 3
ensemble of expert denoisers. ArXiv, abs/2211.01324, 2022. [17] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
1, 2 and Yejin Choi. CLIPScore: A reference-free evaluation
[4] A. Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, metric for image captioning. In Proc. EMNLP, 2021. 6
Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your [18] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
latents: High-resolution video synthesis with latent diffusion hard Nessler, and Sepp Hochreiter. GANs trained by a two
models. 2023 IEEE/CVF Conference on Computer Vision time-scale update rule converge to a local Nash equilibrium.
and Pattern Recognition (CVPR), pages 22563–22575, 2023. NeurIPS, 2017. 6, 12
1, 2 [19] Jonathan Ho. Classifier-free diffusion guidance. ArXiv,
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, abs/2207.12598, 2022. 2
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- [20] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion
ing properties in self-supervised vision transformers. In Pro- probabilistic models. ArXiv, abs/2006.11239, 2020. 1
ceedings of the IEEE/CVF international conference on com- [21] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang,
puter vision, pages 9650–9660, 2021. 6 Ruiqi Gao, Alexey A. Gritsenko, Diederik P. Kingma, Ben
[6] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans.
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, Imagen video: High definition video generation with diffusion
William T Freeman, Michael Rubinstein, et al. Muse: Text-to- models. ArXiv, abs/2210.02303, 2022. 1, 2
image generation via masked generative transformers. Proc. [22] J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu,
ICML, 2023. 6 Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank
[7] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang adaptation of large language models. ArXiv, abs/2106.09685,
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiao- 2021. 2
fang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image [23] Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-
generation models using photogenic needles in a haystack. Jun Zha, and Lei Zhang. Dreamtime: An improved optimiza-
arXiv preprint arXiv:2309.15807, 2023. 4 tion strategy for text-to-3d content creation. arXiv preprint
[8] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Genie: arXiv:2306.12422, 2023. 6
Higher-order denoising diffusion solvers. Advances in Neural [24] Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Ta-
Information Processing Systems, 35:30150–30166, 2022. 2 chet des Combes, and Ioannis Mitliagkas. Adversarial score
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, matching and improved sampling for image generation. arXiv
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, preprint arXiv:2009.05475, 2020. 3
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [25] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli
vain Gelly, et al. An image is worth 16x16 words: Trans- Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans
formers for image recognition at scale. arXiv preprint for text-to-image synthesis. In Proceedings of the IEEE/CVF
arXiv:2010.11929, 2020. 4 Conference on Computer Vision and Pattern Recognition,
[10] Arpad E. Elo. The Rating of Chessplayers, Past and Present. pages 10124–10134, 2023. 1, 2, 6, 14
Arco Pub., New York, 1978. 13 [26] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[11] Patrick Esser, Robin Rombach, and Björn Ommer. Tam- generator architecture for generative adversarial networks.
ing transformers for high-resolution image synthesis. 2021 2019 IEEE/CVF Conference on Computer Vision and Pattern
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4396–4405, 2018. 1, 4, 14
Recognition (CVPR), pages 12868–12878, 2020. 2 [27] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[12] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jaakko Lehtinen, and Timo Aila. Analyzing and improving
Jonathan Granskog, and Anastasis Germanidis. Structure the image quality of stylegan. 2020 IEEE/CVF Conference

9
on Computer Vision and Pattern Recognition (CVPR), pages proach for transferring knowledge from pre-trained diffusion
8107–8116, 2019. 1, 4 models. arXiv preprint arXiv:2305.18455, 2023. 3
[28] Oren Katzir, Or Patashnik, Daniel Cohen-Or, and Dani [43] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik
Lischinski. Noise-free score distillation. arXiv preprint Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans.
arXiv:2310.17590, 2023. 5 On distillation of guided diffusion models. In Proceedings of
[29] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- the IEEE/CVF Conference on Computer Vision and Pattern
rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Recognition, pages 14297–14306, 2023. 2, 7
Mitsufuji, and Stefano Ermon. Consistency trajectory models: [44] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Learning probability flow ode trajectory of diffusion. arXiv Which training methods for gans do actually converge? In
preprint arXiv:2310.02279, 2023. 3 International conference on machine learning, pages 3481–
[30] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- 3490. PMLR, 2018. 5
tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset [45] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and
of user preferences for text-to-image generation, 2023. 12 Daniel Cohen-Or. Latent-nerf for shape-guided generation
[31] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun of 3d shapes and textures. 2023 IEEE/CVF Conference on
Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Computer Vision and Pattern Recognition (CVPR), pages
Text-to-image diffusion model on mobile devices within two 12663–12673, 2022. 2
seconds. arXiv preprint arXiv:2306.00980, 2023. 7 [46] Takeru Miyato and Masanori Koyama. cgans with projection
[32] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv discriminator. arXiv preprint arXiv:1802.05637, 2018. 4
preprint arXiv:1705.02894, 2017. 5 [47] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo,
[33] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Com- Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel
mon diffusion noise schedules and sample steps are flawed, Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2:
2023. 4 Learning robust visual features without supervision. arXiv
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- preprint arXiv:2304.07193, 2023. 6
dev, Ross Girshick, James Hays, Pietro Perona, Deva Ra- [48] Suraj Patil, William Berman, and Patrick von Platen. Amused:
manan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft An open muse model. https://github.com/huggingface/
coco: Common objects in context, 2015. 6 diffusers, 2023. 6, 15
[35] Xingchao Liu, Chengyue Gong, et al. Flow straight and [49] William Peebles and Saining Xie. Scalable diffusion models
fast: Learning to generate and transfer data with rectified with transformers. In Proceedings of the IEEE/CVF Inter-
flow. In The Eleventh International Conference on Learning national Conference on Computer Vision, pages 4195–4205,
Representations, 2022. 2 2023. 6
[36] Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and [50] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann,
Qiang Liu. Instaflow: One step is enough for high-quality Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
diffusion-based text-to-image generation. arXiv preprint Sdxl: Improving latent diffusion models for high-resolution
arXiv:2309.06380, 2023. 2, 3, 7, 15 image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2,
[37] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan 4, 5, 6, 7, 8, 12, 13
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion [51] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall.
probabilistic model sampling in around 10 steps. Advances in Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint
Neural Information Processing Systems, 35:5775–5787, 2022. arXiv:2209.14988, 2022. 2, 5, 12
2, 7 [52] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[38] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Hang Zhao. Latent consistency models: Synthesizing Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
high-resolution images with few-step inference. ArXiv, transferable visual models from natural language supervi-
abs/2310.04378, 2023. 2, 13 sion. In International conference on machine learning, pages
[39] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang 8748–8763. PMLR, 2021. 6, 12
Zhao. Latent consistency models: Synthesizing high- [53] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
resolution images with few-step inference. arXiv preprint and Mark Chen. Hierarchical text-conditional image gener-
arXiv:2310.04378, 2023. 6 ation with clip latents. ArXiv, abs/2204.06125, 2022. 1, 2,
[40] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von 4
Platen, Apolin’ario Passos, Longbo Huang, Jian Li, and Hang [54] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick
Zhao. Lcm-lora: A universal stable-diffusion acceleration Esser, and Björn Ommer. High-resolution image synthesis
module. ArXiv, abs/2311.05556, 2023. 2, 3, 13, 15 with latent diffusion models. 2022 IEEE/CVF Conference
[41] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von on Computer Vision and Pattern Recognition (CVPR), pages
Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang 10674–10685, 2021. 1, 2, 5, 6, 8
Zhao. Lcm-lora: A universal stable-diffusion acceleration [55] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li,
module. arXiv preprint arXiv:2311.05556, 2023. 6 Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael
[42] Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho-
Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal ap- torealistic text-to-image diffusion models with deep language

10
understanding. Advances in Neural Information Processing [70] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling
Systems, 35:36479–36494, 2022. 4, 6 the generative learning trilemma with denoising diffusion
[56] Tim Salimans and Jonathan Ho. Progressive distillation for gans. arXiv preprint arXiv:2112.07804, 2021. 3
fast sampling of diffusion models. CoRR, abs/2202.00512, [71] Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufo-
2022. 2 gen: You forward once large scale text-to-image generation
[57] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. via diffusion gans. arXiv preprint arXiv:2311.09257, 2023. 7
Projected gans converge faster. Advances in Neural Informa- [72] Chun-Han Yao, Amit Raj, Wei-Chih Hung, Yuanzhen Li,
tion Processing Systems, 34:17480–17492, 2021. 5 Michael Rubinstein, Ming-Hsuan Yang, and Varun Jampani.
[58] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Artic3d: Learning robust articulated 3d shapes from noisy
Scaling stylegan to large diverse datasets. ACM SIGGRAPH web image collections. arXiv preprint arXiv:2306.04619,
2022 Conference Proceedings, 2022. 1, 4 2023. 4
[59] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and [73] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan
Timo Aila. Stylegan-t: Unlocking the power of gans for fast Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei
large-scale text-to-image synthesis. Proc. ICML, 2023. 2, 3, Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana
4, 5, 6, 14 Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui
[60] Juergen Schmidhuber. Generative adversarial networks are Wu. Scaling autoregressive models for content-rich text-to-
special cases of artificial curiosity (1990) and also closely image generation, 2022. 12
related to predictability minimization (1991), 2020. 4 [74] Qinsheng Zhang and Yongxin Chen. Fast sampling of dif-
[61] Christoph Schuhmann, Romain Beaumont, Richard Vencu, fusion models with exponential integrator. arXiv preprint
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, arXiv:2204.13902, 2022. 2
Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al.
LAION-5B: An open large-scale dataset for training next
generation image-text models. In NeurIPS, 2022. 6
[62] Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual,
Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea
Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy-
namic scene generation. arXiv preprint arXiv:2301.11280,
2023. 3
[63] Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Ma-
heswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics. ArXiv,
abs/1503.03585, 2015. 1
[64] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising
diffusion implicit models. In International Conference on
Learning Representations, 2021. 2
[65] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P.
Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differen-
tial equations. ArXiv, abs/2011.13456, 2020. 1
[66] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
Consistency models. In International Conference on Machine
Learning, 2023. 2
[67] George Stein, Jesse C Cresswell, Rasa Hosseinzadeh, Yi Sui,
Brendan Leigh Ross, Valentin Villecroze, Zhaoyan Liu, An-
thony L Caterini, J Eric T Taylor, and Gabriel Loaiza-Ganem.
Exposing flaws of generative model evaluation metrics and
their unfair treatment of diffusion models. arXiv preprint
arXiv:2306.04675, 2023. 6, 14
[68] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh,
and Greg Shakhnarovich. Score jacobian chaining: Lifting
pretrained 2d diffusion models for 3d generation. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 12619–12629, 2023. 2, 5
[69] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and
diverse text-to-3d generation with variational score distilla-
tion. ArXiv, abs/2305.16213, 2023. 2

11
Appendix
A. SDS As a Special Case of the Distillation Loss
αt
If we set the weighting function to c(t) = 2σ t
w(t) where w(t) is the scaling factor from the weighted diffusion loss as in [51]
2
and choose d(x, y) = ||x − y||2 , the distillation loss in Eq. (4) is equivalent to the score distillation objective:

d MSE
L
dθ distill
h d i
= Et,ϵ′ c(t) ||x̂θ − x̂ψ (sg(x̂θ,t ); t)||22
dθ
h dx̂θ i
= Et,ϵ′ 2c(t)[x̂θ − x̂ψ (x̂θ,t ; t)]
dθ
hα 1 dx̂θ i
t
= Et,ϵ′ w(t)[ (x̂θ,t − x̂θ,t ) + x̂θ − x̂ψ (x̂θ,t ; t)] (5)
σt αt dθ
h1 dx̂θ i
= Et,ϵ′ w(t)[(αt x̂θ − x̂θ,t ) − (αt x̂ψ (x̂θ,t ; t) − x̂θ,t )]
σt dθ
h w(t) dx̂θ i
′
= Et,ϵ′ [−σt ϵ + σt ϵ̂θ (x̂θ,t ; t)]
σt dθ
d
= LSDS
dθ

B. Details on Human Preference Assessment

For the evaluation results presented in Figures 5 to 7, we employ human evaluation and do not rely on commonly used
metrics for quality assessment of generative models such as FID [18] and CLIP-score [52], since these have been shown to
capture more fine grained aspects like aesthetics and scene composition only insufficiently [30, 50]. However these categories
in particular have become more and more important when comparing current state-of-the-art text-to-image models. We
evaluate all models based on 100 selected prompts from the PartiPrompts benchmark [73] with the most relevant categories
(excluding prompts from the category basic). More details on how the study was conducted Appendix B.1 and the rankings
computed Appendix B.2 are listed below.

Figure 9. User preference study (single step). We compare the performance of ADD-M (1-step) against established baselines.

12
Figure 10. User preference study (multiple steps). We compare the performance of ADD-XL (4-step) against established baselines.

B.1. Experimental Setup

Given all models for one particular study (e.g. ADD-XL, OpenMUSE6 , IF-XL7 , SDXL [50] and LCM-XL8 [38, 40] in
Figure 7) we compare each prompt for each pair of models (1v1). For every comparison, we collect an average of four votes
per task from different annotators, for both visual quality and prompt following. Human evaluators, recruited from the platform
Prolific9 with English as their first language, are shown two images from different models based on the same text prompt. To
prevent biases, evaluators are restricted from participating in more than one of our studies. For the prompt following task,
we display the text prompt above the two images and ask, “Which image looks more representative of the text shown above
and faithfully follows it?” For the visual quality assessment, we do not show the prompt and instead ask, “Which image is of
higher quality and aesthetically more pleasing?”. Performing a complete assessment between all pair-wise comparisons gives
us robust and reliable signals on model performance trends and the effect of varying thresholds. The order of prompts and the
order between models are fully randomized. Frequent attention checks are in place to ensure data quality.

B.2. ELO Score Calculation

To calculate rankings when comparing more than two models based on 1v1 comparisons we use ELO Scores (higher-is-
better) [10] which were originally proposed as a scoring method for chess players but have more recently also been applied to
compare instruction-tuned generative LLMs [1, 2]. For a set of competing players with initial ratings Rinit participating in a
series of zero-sum games the ELO rating system updates the ratings of the two players involved in a particular game based on
the expected and and actual outcome of that game. Before the game with two players with ratings R1 and R2 , the expected
outcome for the two players are calculated as

1
E1 = R2 −R1 , (6)
1 + 10 400

1
E2 = R1 −R2 . (7)
1 + 10 400

After observing the result of the game, the ratings Ri are updated via the rule
′
Ri = Ri + K · (Si − Ei ) , i ∈ {1, 2} (8)

where Si indicates the outcome of the match for player i. In our case we have Si = 1 if player i wins and Si = 0 if player
i looses. The constant K can be see as weight putting emphasis on more recent games. We choose K = 1 and bootstrap
the final ELO ranking for a given series of comparisons based on 1000 individual ELO ranking calculations with randomly
shuffled order. Before comparing the models we choose the start rating for every model as Rinit = 1000.
6 https://huggingface.co/openMUSE
7 https://github.com/deep-floyd/IF
8 https://huggingface.co/latent-consistency/lcm-lora-sdxl
9 https://app.prolific.com

13
C. GAN Baselines Comparison
For training our state-of-the-art GAN baseline StyleGAN-T++, we follow the training procedure outlined in [59]. The main
differences are extended training (∼2M iterations with a batch size of 2048, which is comparable to GigaGAN’s schedule [25]),
the improved discriminator architecture proposed in Section 3.2, and R1 penalty applied at each discriminator head.
Fig. 11 shows that StyleGAN-T++ outperforms the previous best GANs by achieving a comparable zero-shot FID to
GigaGAN at a significantly higher CLIP score. Here, we do not compare to DMs, as comparisons between model classes via
automatic metrics tend to be less informative [67]. As an example, GigaGAN achieves FID and CLIP scores comparable to
SD1.5, but its sample quality is still inferior, as noted by the authors.

26
GigaGAN
Zero-shot FID5k

24 StyleGAN-T
StyleGAN-T++
22
20
18
16
14
0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36
CLIP score (ViT-g-14)
Figure 11. Comparing text alignment tradeoffs at 256 × 256 pixels. We compare FID–CLIP score curves of StyleGAN-T, StyleGAN-T++,
and GigaGAN. For increasing CLIP score, all methods use via decreasing truncation [26] for values ψ = {1.0, 0.9, . . . , 0.3}.

Figure 12. Additional single step 5122 images generated with ADD-XL. All samples are generated with a single U-Net evaluation trained
with adversarial diffusion distillation (ADD).

14
D. Additional Samples
We show additional one-step samples as in Figure 1 in Figure 12. An additional qualitative comparison as in Figure 4 which
demonstrates that our model can further refine quality by using more than one sampling step is provided in Figure 14, where
we show that, while sampling quality with a single step is already high, more steps can give higher diversity and better spelling
capabilities. Lastly, we provide an additional qualitative comparison of ADD-XL to other state-of-the-art one and few-step
models in Figure 13.

Teddy bears working on new AI research on the moon in the

A cinematic shot of robot with colorful feathers. 1980s.
ADD-XL
(1 step)
LCM-XL
(1 step)
ADD-XL
(2 steps)
LCM-XL
(2 steps)
ADD-XL
(4 steps)
LCM-XL
(4 steps)
InstaFlow
(1 step)
OpenMUSE
(16 steps)

Figure 13. Additional qualitative comparisons to state of the art fast samplers. Few step samples from our ADD-XL and LCM-XL [40],
InstaFlow [36], and OpenMuse [48].

15
“A portrait photo of a kangaroo wearing an orange hoodie and
blue sunglasses standing on the grass in front of the Sydney
“a robot is playing the guitar at a rock concert in front of a large Opera House holding a sign on the chest that says Welcome
crowd.” Friends!”
1 step
2 steps
4 steps

Figure 14. Additional results on the qualitative effect of sampling steps. Similar to Figure 4, we show qualitative examples when
sampling ADD-XL with 1, 2, and 4 steps. Single-step samples are often already of high quality, but increasing the number of steps can
further improve the diversity (left) and spelling capabilities (right). The seeds are constant within columns and we see that the general layout
is preserved across sampling steps, allowing for fast exploration of outputs while retaining the possibility to refine.

Weekend Windfalls: Trading Manual & Quick-Start Guide
No ratings yet
Weekend Windfalls: Trading Manual & Quick-Start Guide
33 pages
Diffusion: by Aryan Jain
100% (1)
Diffusion: by Aryan Jain
55 pages
DMD Lowres
No ratings yet
DMD Lowres
22 pages
Fast High-Resolution Image Synthesis With Latent Adversarial Diffusion Distillation
No ratings yet
Fast High-Resolution Image Synthesis With Latent Adversarial Diffusion Distillation
19 pages
Autoregressive Distillation of Diffusion Transformers
No ratings yet
Autoregressive Distillation of Diffusion Transformers
27 pages
tt3d Gen
No ratings yet
tt3d Gen
32 pages
L C M: S H - R I F - I: Atent Onsistency Odels Ynthesizing IGH Esolution Mages With EW Step Nference
No ratings yet
L C M: S H - R I F - I: Atent Onsistency Odels Ynthesizing IGH Esolution Mages With EW Step Nference
18 pages
Rombach High-Resolution Image Synthesis With Latent Diffusion Models CVPR 2022 Paper-2
No ratings yet
Rombach High-Resolution Image Synthesis With Latent Diffusion Models CVPR 2022 Paper-2
12 pages
Consistency Models
No ratings yet
Consistency Models
41 pages
Diffusion2GAN Distilling Diffusion Models Into Conditional GANs
No ratings yet
Diffusion2GAN Distilling Diffusion Models Into Conditional GANs
28 pages
One-Step Diffusion Models With F-Divergence Distribution Matching
No ratings yet
One-Step Diffusion Models With F-Divergence Distribution Matching
28 pages
Stable Diffusion
No ratings yet
Stable Diffusion
23 pages
A Survey On Generative Diffusion Models
No ratings yet
A Survey On Generative Diffusion Models
26 pages
2582 Elucidating The Design Space o
No ratings yet
2582 Elucidating The Design Space o
13 pages
S, S & S C - T C M: Implifying Tabilizing Caling Ontinuous IME Onsistency Odels
No ratings yet
S, S & S C - T C M: Implifying Tabilizing Caling Ontinuous IME Onsistency Odels
38 pages
Gan Diffusion
No ratings yet
Gan Diffusion
9 pages
Consistency Models
No ratings yet
Consistency Models
42 pages
Distildire2406 00856v1
No ratings yet
Distildire2406 00856v1
6 pages
NeurIPS 2021 Diffusion Models Beat Gans On Image Synthesis Paper
No ratings yet
NeurIPS 2021 Diffusion Models Beat Gans On Image Synthesis Paper
15 pages
Region-Adaptive Sampling For Diffusion Transformers
No ratings yet
Region-Adaptive Sampling For Diffusion Transformers
16 pages
Progressive Distillation For Fast Sampling
No ratings yet
Progressive Distillation For Fast Sampling
21 pages
Tiny Models From Tiny Data Textual and Null-Text I
No ratings yet
Tiny Models From Tiny Data Textual and Null-Text I
21 pages
Elucidating The Design Space of Diffusion-Based Generative Models
No ratings yet
Elucidating The Design Space of Diffusion-Based Generative Models
47 pages
DDT D D T:: Ecoupled Iffusion Ransformer
No ratings yet
DDT D D T:: Ecoupled Iffusion Ransformer
13 pages
1982 Simplifying Stabilizing A
No ratings yet
1982 Simplifying Stabilizing A
15 pages
Kim Arbitrary-Scale Image Generation and Upsampling Using Latent Diffusion Model and CVPR 2024 Paper
No ratings yet
Kim Arbitrary-Scale Image Generation and Upsampling Using Latent Diffusion Model and CVPR 2024 Paper
10 pages
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
No ratings yet
You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs
14 pages
Synthetic Data From Diffusion Models Improves ImageNet Classification
No ratings yet
Synthetic Data From Diffusion Models Improves ImageNet Classification
19 pages
Generative AI in Vision: A Survey On Models, Metrics and Applications
No ratings yet
Generative AI in Vision: A Survey On Models, Metrics and Applications
12 pages
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
No ratings yet
Exploring The Various Machine Learning Models For Image Generation - A Comprehensive Survey Unlocking The Future of Digital Creativity
15 pages
New Denoising Diffusion Model
No ratings yet
New Denoising Diffusion Model
13 pages
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
No ratings yet
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models For 3D Generation
16 pages
LCM-LoRA - A Universal Stable-Diffusion Acceleration Module
No ratings yet
LCM-LoRA - A Universal Stable-Diffusion Acceleration Module
7 pages
KD Dlgan
No ratings yet
KD Dlgan
11 pages
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
No ratings yet
Diffusion Models in Vision: A Survey: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
25 pages
Cifar 10
No ratings yet
Cifar 10
31 pages
LCM LoRA Technical Report
No ratings yet
LCM LoRA Technical Report
7 pages
Denoising Diffusion Implicit Models
No ratings yet
Denoising Diffusion Implicit Models
22 pages
DIRE For Diffusion-Generated Image Detection
No ratings yet
DIRE For Diffusion-Generated Image Detection
17 pages
Lec16 DiffusionModels
No ratings yet
Lec16 DiffusionModels
57 pages
Module 5
No ratings yet
Module 5
23 pages
Module 5
No ratings yet
Module 5
23 pages
Diffusion
No ratings yet
Diffusion
55 pages
DEEPDISTAL Deepfake Dataset Distillation Using Active Learning CVPRW 2024 Paper
No ratings yet
DEEPDISTAL Deepfake Dataset Distillation Using Active Learning CVPRW 2024 Paper
8 pages
Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
Few-Shot Image Generation Via Style Adaptation and Content Preservation
No ratings yet
Few-Shot Image Generation Via Style Adaptation and Content Preservation
12 pages
Sinddm: A Single Image Denoising Diffusion Model
No ratings yet
Sinddm: A Single Image Denoising Diffusion Model
39 pages
Boosting Latent Diffusion With Perceptual Objectives
No ratings yet
Boosting Latent Diffusion With Perceptual Objectives
19 pages
Control 3 Diff
No ratings yet
Control 3 Diff
21 pages
DIFFBLENDER Scalable and Composable
No ratings yet
DIFFBLENDER Scalable and Composable
18 pages
Diffusion Self-Distillation For Zero-Shot Customized Image Generation
No ratings yet
Diffusion Self-Distillation For Zero-Shot Customized Image Generation
22 pages
Diffusion-Generated Image Detection
No ratings yet
Diffusion-Generated Image Detection
11 pages
CVAE-GAN Fine-Grained Image Generation Through Asymmetric Training
No ratings yet
CVAE-GAN Fine-Grained Image Generation Through Asymmetric Training
10 pages
Research On Extended Image Data Set Based On Deep Convolution Generative Adversarial Network
No ratings yet
Research On Extended Image Data Set Based On Deep Convolution Generative Adversarial Network
4 pages
GLea D
No ratings yet
GLea D
12 pages
Methods and Trends in Detecting Generated Images: A Comprehensive Review
No ratings yet
Methods and Trends in Detecting Generated Images: A Comprehensive Review
30 pages
Deep Generative Adversarial Networks For Image-To
No ratings yet
Deep Generative Adversarial Networks For Image-To
26 pages
ICML - 2017 - ACGAN - Conditional Image Synthesis With Auxiliary Classifier GANs
No ratings yet
ICML - 2017 - ACGAN - Conditional Image Synthesis With Auxiliary Classifier GANs
12 pages
Arbitrary Style Guidance For Enhanced Diffusion-Based Text-to-Image Generation
No ratings yet
Arbitrary Style Guidance For Enhanced Diffusion-Based Text-to-Image Generation
11 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
CBR Proposal
No ratings yet
CBR Proposal
14 pages
Operation and Maintenance Manual: Effluent Treatment Plant
100% (2)
Operation and Maintenance Manual: Effluent Treatment Plant
49 pages
Digestion of Carbohydrates
No ratings yet
Digestion of Carbohydrates
4 pages
November 2015
100% (3)
November 2015
100 pages
Search:: A Really Simple Database
No ratings yet
Search:: A Really Simple Database
30 pages
(Preview) Notifier - MMX-1 (A) Monitor Module CMX-2 (A) Control Module and ISO-X
No ratings yet
(Preview) Notifier - MMX-1 (A) Monitor Module CMX-2 (A) Control Module and ISO-X
1 page
Housekeeping
No ratings yet
Housekeeping
1 page
National Apprenticeship Training Scheme (NATS)
No ratings yet
National Apprenticeship Training Scheme (NATS)
5 pages
ASHS Financial Aid Application - 2025-2026
No ratings yet
ASHS Financial Aid Application - 2025-2026
6 pages
Impro New 2.7 Preview
No ratings yet
Impro New 2.7 Preview
24 pages
Arcs and Inscribed Angle
No ratings yet
Arcs and Inscribed Angle
29 pages
9 Types of Organization
No ratings yet
9 Types of Organization
4 pages
6-Strategic Alternatives and Choice
81% (16)
6-Strategic Alternatives and Choice
20 pages
Mikro DM38
No ratings yet
Mikro DM38
2 pages
Updated CV Hrithik Mhatre
No ratings yet
Updated CV Hrithik Mhatre
2 pages
Presentation - J&J 2
0% (1)
Presentation - J&J 2
47 pages
Embroidery Stitches
No ratings yet
Embroidery Stitches
16 pages
B 2 B Sales Manager Checklist
No ratings yet
B 2 B Sales Manager Checklist
1 page
MVP Comprehensive Resource Impacts Agreement
No ratings yet
MVP Comprehensive Resource Impacts Agreement
16 pages
2018-1 - Classifications in Brief Tonnis Classification of Hip Osteoarthritis
No ratings yet
2018-1 - Classifications in Brief Tonnis Classification of Hip Osteoarthritis
5 pages
Potato Specification
No ratings yet
Potato Specification
3 pages
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
No ratings yet
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
8 pages
SuperiorBroomDT80 CT
No ratings yet
SuperiorBroomDT80 CT
2 pages
Seminar On Schedule U: Presented by
No ratings yet
Seminar On Schedule U: Presented by
21 pages
3rd Quarter ACR
No ratings yet
3rd Quarter ACR
4 pages
Pandas Viva Questions
No ratings yet
Pandas Viva Questions
23 pages
Standard Atmosphere For Measuring and Testing
No ratings yet
Standard Atmosphere For Measuring and Testing
2 pages
Magic and The Mind
No ratings yet
Magic and The Mind
379 pages
IJCRT2310639
No ratings yet
IJCRT2310639
9 pages