DistriFusion - Distributed Parallel Inference For High - Resolution Diffusion Models
DistriFusion - Distributed Parallel Inference For High - Resolution Diffusion Models
Muyang Li1 * Tianle Cai2 * Jiaxin Cao3 Qinsheng Zhang4 Han Cai1
Junjie Bai3 Yangqing Jia3 Ming-Yu Liu4 Kai Li2 Song Han1,4
1
MIT 2 Princeton 3 Lepton AI 4 NVIDIA
https://github.com/mit-han-lab/distrifuser
arXiv:2402.19481v3 [cs.CV] 15 Apr 2024
Prompt: Ethereal fantasy concept art of an elf, magnificent, celestial, ethereal, painterly, epic, majestic, magical, fantasy art, cover art, dreamy.
Prompt: Romantic painting of a ship sailing in a stormy sea, with dramatic lighting and powerful waves.
Figure 1. We introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without
sacrificing image quality. Naïve Patch (Figure 2(b)) suffers from the fragmentation issue due to the lack of patch interaction. Our
DistriFusion removes artifacts and avoids the communication overhead by reusing the features from the previous steps. Setting: SDXL
with 50-step Euler sampler, 1280 × 1920 resolution. Latency is measured on A100s.
1. Introduction
The advent of AI-generated content (AIGC) represents a U-Net U-Net … U-Net
seismic shift in technological innovation. Tools like Adobe (b) Naïve Patch
Firefly, Midjourney and recent Sora showcase astonishing ca-
pabilities, producing compelling imagery and designs from
simple text prompts. These achievements are notably sup-
U-Net U-Net … U-Net
ported by the progression in diffusion models [13, 60]. The
emergence of large text-to-image models, including Stable ⨁ ⨁ …
Diffusion [54], Imgen [56], eDiff-I [2], DALL·E [3, 48, 49]
and Emu [6], further expands the horizons of AI creativity. U-Net U-Net … U-Net
Trained on diverse open-web data, these models can
(c) DistriFusion
generate photorealistic images from text descriptions alone.
Such technological revolution unlocks numerous synthesis Computation Asynchronous Communication
and editing applications for images and videos, placing new On Device 1 On Device 2 ⨁ Asynchronous AllGather
demands on responsiveness: by interactively guiding and Figure 2. (a) Original diffusion model running on a single device.
refining the model output, users can achieve more person- (b) Naïvely splitting the image into 2 patches across 2 GPUs has
alized and precise results. Nonetheless, a critical challenge an evident seam at the boundary due to the absence of interaction
remains – high resolution leading to large computation. For across patches. (c) DistriFusion employs synchronous communica-
example, the original Stable Diffusion [54] is limited to tion for patch interaction at the first step. After that, we reuse the
generating 512 × 512 images. Later, SDXL [46] expands activations from the previous step via asynchronous communica-
the capabilities to 1024 × 1024 images. More recently, Sora tion. In this way, the communication overhead can be hidden into
further pushes the boundaries by enabling video generation the computation pipeline.
at 1080 × 1920 resolution. Despite these advancements,
the increased latency of generating high-resolution images several patches, assigning each patch to a different device for
presents a tremendous barrier to real-time applications. generation, as illustrated in Figure 2(b). This method allows
Recent efforts to accelerate diffusion model inference for independent and parallel operations across devices.
have mainly focused on two approaches: reducing sampling However, it suffers from a clearly visible seam at the
steps [20, 32, 33, 36, 57, 61, 70, 73] and optimizing neu- boundaries of each patch due to the absence of interaction
ral network inference [23, 25, 26]. As computational re- between the individual patches. However, introducing
sources grow rapidly, leveraging multiple GPUs to speed interactions among patches to address this issue would incur
up inference is appealing. For example, in natural language excessive synchronization costs again, offsetting the benefits
processing (NLP), large language models have successfully of parallel processing.
harnessed tensor parallelism across GPUs, significantly re- In this work, we present DistriFusion, a method that
ducing latency. However, for diffusion models, multiple enables running diffusion models across multiple devices
GPUs are usually only used for batch inference. When gen- in parallel to reduce the latency of single-sample generation
erating a single image, typically only one GPU is involved without hurting image quality. As depicted in Figure 2(c),
(Figure 2(a)). Techniques like tensor parallelism are less our approach is also based on patch parallelism, which
suitable for diffusion models due to the large activation size, divides the image into multiple patches, each assigned to
as communication costs outweigh savings from distributed a different device. Our key observation is that the inputs
computation. Thus, even when multiple GPUs are available, across adjacent denoising steps in diffusion models are
they cannot be effectively exploited to further accelerate similar. Therefore, we adopt synchronous communication
single-image generation. This motivates the development of solely for the first step. For the subsequent steps, we
a method that can utilize multiple GPUs to speed up single- reuse the pre-computed activations from the previous step
image generation with diffusion models. to provide global context and patch interactions for the
A naïve approach would be to divide the image into current step. We further co-design an inference framework
to implement our algorithm. Specifically, our framework a new paradigm for accelerating diffusion by leveraging
effectively hides the communication overhead within parallelism to the neural network on multiple devices.
the computation via asynchronous communication. It Parallelism. Existing work has explored various parallelism
also sparsely runs the convolutional and attention layers strategies to accelerate the training and inference of large lan-
exclusively on the assigned regions, thereby proportionally guage models (LLMs), including data, pipeline [15, 27, 38],
reducing per-device computation. Our method, distinct tensor [17, 39, 71, 72, 78], and zero-redundancy paral-
from data, tensor, or pipeline parallelism, introduces a new lelism [47, 50, 51, 77]. Tensor parallelism, in particular,
parallelization opportunity: displaced patch parallelism. has been widely adopted for accelerating LLMs [28], which
DistriFusion only requires off-the-shelf pre-trained diffu- are characterized by their substantial model sizes, whereas
sion models and is applicable to a majority of few-step sam- their activation sizes are relatively small. In such scenarios,
plers. We benchmark it on a subset of COCO Captions [5]. the communication overhead introduced by tensor paral-
Without loss of visual fidelity, it mirrors the performance lelism is relatively minor compared to the substantial latency
of the original Stable Diffusion XL (SDXL) [46] while benefits brought by increased memory bandwidth. However,
reducing the computation* proportionally to the number of the situation differs for diffusion models, which are gener-
used devices. Furthermore, our framework also reduces the ally smaller than LLMs but are often bottlenecked by the
latency of SDXL U-Net for generating a single image by up large activation size due to the spatial dimensions, especially
to 1.8×, 3.4× and 6.1× with 2, 4, and 8 A100 GPUs, respec- when generating high-resolution content. The communica-
tively. When combined with batch splitting for classifier-free tion overhead from tensor parallelism becomes a significant
guidance [12], we achieve in total 3.6× and 6.6× speedups factor, overshadowing the actual computation time. As a
using 4 and 8 A100 GPUs for 3840 × 3840 images, respec- result, only data parallelism has been used thus far for dif-
tively. See Figure 1 for some examples of our method. fusion model serving, which provides no latency improve-
ments. The only exception is ParaDiGMS [59], which uses
2. Related Work Picard iteration to run multiple steps in parallel. However,
this sampler tends to waste much computation, and the gen-
Diffusion models. Diffusion models have significantly trans- erated results exhibit significant deviation from the original
formed the landscape of content generation [2, 13, 41, 46]. diffusion model. Our method is based on patch parallelism,
At its core, these models synthesize content through an which distributes the computation across multiple devices
iterative denoising process. Although this iterative approach by splitting the input into small patches. Compared to tensor
yields unprecedented capabilities for content generation, parallelism, such a scheme has superior independence and
it requires substantially more computational resources and reduced communication demands. Additionally, it favors
results in slower generative speed. This issue intensifies the use of AllGather over AllReduce for data interac-
with the synthesis of high-dimensional data, such as high- tion, significantly lowering overhead (see Section 5.3 for the
resolution [9, 14] or 360◦ images [75]. Researchers have full comparisons). Drawing inspiration from the success of
investigated various perspectives to accelerate the diffusion asynchronous communication in parallel computing [67], we
model. The first line lies in designing more efficient denois- further reuse the features from the previous step as context
ing processes. Rombach et al. [54] and Vahdat et al. [66] for current step to overlap communication and computation,
propose to compress high-resolution images into low- called displaced patch parallelism. This represents the first
resolution latent representations and learn diffusion model in parallelism strategy tailored to the sequential characteristics
latent space. Another line lies in improving sampling via de- of diffusion models while avoiding the heavy communication
signing efficient training-free sampling algorithms. A large costs of traditional techniques like tensor parallelism.
category of works along this line is built upon the connection
Sparse computation. Sparse computation has been exten-
between diffusion models and differential equations [62],
sively researched in various domains, including weight [10,
and leverage a well-established exponential integra-
16, 21, 31], input [53, 64, 65] and activation [7, 18, 23, 24,
tor [32, 73, 74] to reduce sampling steps while maintaining
42, 52, 52, 58]. In the activation domain, to facilitate on-
numerical accuracy. The third strategy involves distilling
hardware speedups, several studies propose to use structured
faster generative models from pre-trained diffusion models.
sparsity. SBNet [52] employs a spatial mask to sparsify acti-
Despite significant progress made in this area, a quality gap
vations for accelerating 3D object detection. This mask can
persists between these expedited generators and diffusion
be derived either from prior problem knowledge or an auxil-
models [19, 36, 57]. In addition to the above schemes, some
iary network. In the context of image generation, SIGE [23]
works investigate how to optimize the neural inference for
leverages the highly structured sparsity of user edits, selec-
diffusion models [23, 25, 26]. In this work, we explore
tively performing computation at the edited regions to speed
* Following previous works, we measure the computational cost with the up GANs [8] and diffusion models. MCUNetV2[29] adopts
number of Multiply-Accumulate operations (MACs). 1 MAC=2 FLOPs. a patch-based inference to reduce memory usage for image
classification and detection. In our work, we also partition However, the first approach leads to visible discrepancies at
the input into patches, each processed by a different device. the boundaries of each patch due to the absence of interaction
However, we focus on reducing the latency by parallelism between them (see Figure 1 and Figure 2(b)). The second
for image generation instead. Each device will solely process approach, on the other hand, incurs excessive communica-
the assigned regions to reduce the per-device computation. tion overheads, negating the benefits of parallel processing.
To address these challenges, we propose a novel parallelism
3. Background paradigm, displaced patch parallelism, which leverages the
sequential nature of diffusion models to overlap communica-
To generate a high-quality image, a diffusion model often
tion and computation. Our key insight is reusing slightly out-
trains a noise-prediction neural model (e.g., U-Net [55])
dated, or ‘stale’ activations from the previous diffusion step
ϵθ . Starting from pure Gaussian noise xT ∼ N (0, I), it
to facilitate interactions between patches, which we describe
involves tens to hundreds of iterative denoising steps to get
as activation displacement. This is based on the observation
the final clean image x0 , where T is the total number of
that the inputs for consecutive denoising steps are relatively
steps. Specifically, given the noisy image xt at time step t,
similar. Consequently, computing each patch’s activation
the model ϵθ takes xt , t and an additional condition c (e.g.,
at a layer does not rely on other patches’ fresh activations,
text) as inputs to predict the corresponding noise ϵt within
allowing communication to be hidden within subsequent lay-
xt . At each denoising step, xt−1 can be derived from the
ers’ computation. We will next provide a detailed breakdown
following equation:
of each aspect of our algorithm and system design.
xt−1 = Update(xt , t, ϵt ), ϵt = ϵθ (xt , t, c). (1) Displaced patch parallelism. As shown in Figure 3, when
predicting ϵθ (xt ) (we omit the inputs of timestep t and con-
Here, ‘Update’ refers to a sampler-specific function that typ- dition c here for simplicity), we first split xt into multiple
ically includes element-wise additions and multiplications. (1) (2) (N )
patches xt , xt , . . . , xt , where N is the number of de-
Therefore, the primary source of latency in this process is
vices. For example, we use N = 2 in Figure 3. Each device
the forward passes through model ϵθ . For example, Stable
has a replicate of the model ϵθ and will process a single
Diffusion XL [46] requires 6,763 GMACs per step to gen-
patch independently, in parallel.
erate a 1024 × 1024 image. This computational demand
escalates more than quadratically with increasing resolution, For a given layer l, let’s consider the input activation
l,(i)
making the latency for generating a single high-resolution patch on the i-th device, denoted as At . This patch is first
image impractically high for real-world applications. Fur- scattered into the stale activations from the previous step,
thermore, given that xt−1 depends on xt , parallel computa- Alt+1 , at its corresponding spatial location (the method for
tion of ϵt and ϵt−1 is challenging. Hence, even with multi- obtaining Alt+1 will be discussed later). Here, Alt+1 is in full
ple idle GPUs, accelerating the generation of a single high- spatial shape. In the Scatter output, only the N1 regions
l,(i)
resolution image remains tricky. Recently, Shih et al. intro- where At is placed are fresh and require recomputation.
duced ParaDiGMS [59], employing Picard iterations to paral- We then selectively apply the layer operation Fl (linear,
lelize the denoising steps in a data-parallel manner. However, convolution, or attention) to these fresh areas, thereby gener-
ParaDiGMS wastes the computation on speculative guesses ating the output for the corresponding regions. This process
that fail quality thresholds. It also relies on a large total is repeated for each layer. Finally, the outputs from all layers
step count T to exploit multi-GPU data parallelism, limiting are synchronized together to approximate ϵθ (xt ). Through
its potential applications. Another conventional method is this methodology, each device is responsible for only N1 of
sharding the model on multiple devices and using tensor the total computations, enabling efficient parallelization.
parallelism for inference. However, this method suffers from
There still remains a problem of how to obtain the stale
intolerable communication costs, making it impractical for
activations from the previous step. As shown in Figure 3, at
real-world applications. Beyond these two schemes, are l,(i)
each timestep t, when device i acquires At , it will then
there alternative strategies for distributing workloads across
broadcast the activations to all other devices and perform the
multiple GPU devices so that single-image generation can
AllGather operation. Modern GPUs often support asyn-
also enjoy the free-lunch speedups from multiple devices?
chronous communication and computation, which means
4. Method that this AllGather process does not block ongoing com-
putations. By the time we reach layer l in the next timestep,
The key idea of DistriFusion is to parallelize computation each device should have already received a replicate of Alt .
across devices by splitting the image into patches. Naïvely, Such an approach effectively hides communication over-
this can be done by either (1) independently computing heads within the computation phase, as shown in Figure 4.
patches and stitching them together, or (2) synchronously However, there is an exception: the very first step (i.e., xT ).
communicating intermediate activations between patches. In this scenario, each device simply executes the standard
Computation Asynchronous Communication
… … … …
A1t+1 Alt+1 ALt+1
OP
Scatter
Fl
x(1)
t
Al,(1)
t ≈ Fl(Alt )(1)
Layer 1 … … Layer L
Al,(2) Alt+1
xt t
≈ ϵθ(xt )
OP
Scatter
Fl
x(2)
t AllGather Layer l ≈ Fl(Alt )(2)
Update
xt−1 A1t Alt ALt
… … … …
Figure 3. Overview of DistriFusion. For simplicity, we omit the inputs of t and c, and use N = 2 devices as an example. Superscripts (1) and
(2)
represent the first and the second patch, respectively. Stale activations from the previous step are darkened. At each step t, we first split
(1) (N ) l,(i)
the input xt into N patches xt , . . . , xt . For each layer l and device i, upon getting the input activation patches At , two operations
l,(i) l
then process asynchronously: First, on device i, At is scattered back into the stale activation At+1 from the previous step. The output
of this Scatter operation is then fed into the sparse operator Fl (linear, convolution, or attention layers), which performs computations
l,(i)
exclusively on the fresh regions and produces the corresponding output. Meanwhile, an AllGather operation is performed over At
l
to prepare the full activation At for the next step. We repeat this procedure for each layer. The final outputs are then aggregated together
to approximate ϵθ (xt ), which is used to compute xt−1 . The timeline visualization of each device for predicting ϵθ (xt ) is shown in Figure 4.
Metrics. Following previous works [22, 23, 37, 43], we FID [44] to calculate FID.
Original Naïve Patch (2 Devices) ParaDiGMS (8 Devices) Ours (2 Devices) Ours (4 Devices) Ours (8 Devices)
Latency: 5.02s Latency: 2.83s (1.8× Faster) Latency: 1.80s (2.8× Faster) Latency: 3.35s (1.5× Faster) Latency: 2.26s (2.2× Faster) Latency: 1.77s (2.8× Faster)
FID: 24.0 FID: 33.6 FID: 25.1 FID: 24.0 FID: 24.2 FID: 24.3
Prompt: A brown dog laying on the ground with a metal bowl in front of him.
Figure 5. Qualitative results. FID is computed against the ground-truth images. Our DistriFusion can reduce the latency according to the
number of used devices while preserving visual fidelity.
A100 GPUs. In the 50-step setting, ParaDiGMS achieves device usage, we further scale the resolution to 2048 × 2048
an identical speedup of 2.8× to our method at the cost of and 3840 × 3840 in Figure 6. At these larger resolutions,
compromised image quality (see Figure 5). In the more com- the GPU devices are better utilized. Specifically, for
monly used 25-step setting, ParaDiGMS only has a marginal 3840 × 3840 images, DistriFusion reduces the latency by
1.3× speedup due to excessive wasted guesses, which is also 1.8×, 3.4× and 6.1× with 2, 4 and 8 A100s, respectively.
reported in Shih et al. [59]. However, our method can still Note that these results are benchmarked with PyTorch.
mirror the original quality and accelerate the model by 2.7×. With more advanced compilers, such as TVM [4] and
TensorRT [1], we anticipate even higher GPU utilization and
When generating 1024 × 1024 images, our speedups are consequently more pronounced speedups from DistriFusion,
limited by the low GPU utilization of SDXL. To maximize
LPIPS (↓) FID (↓) Latency
#Steps #Devices Method PSNR (↑) MACs (T)
w/ G.T. w/ Orig. w/ Orig. w/ G.T. Value (s) Speedup
1 Original – 0.797 – 24.0 – 338 5.02 –
Naïve Patch 28.2 0.812 0.596 33.6 29.4 322 2.83 1.8×
2
Ours 31.9 0.797 0.146 24.2 4.86 338 3.35 1.5×
Naïve Patch 27.9 0.853 0.753 125 133 318 1.74 2.9×
50 4
Ours 31.0 0.798 0.183 24.2 5.76 338 2.26 2.2×
Naïve Patch 27.8 0.892 0.857 252 259 324 1.27 4.0×
8 ParaDiGMS 29.3 0.800 0.320 25.1 10.8 657 1.80 2.8×
Ours 30.5 0.799 0.211 24.4 6.46 338 1.77 2.8×
1 Original – 0.801 – 23.9 – 169 2.52 –
25 ParaDiGMS 29.6 0.808 0.273 25.8 10.4 721 1.89 1.3×
8
Ours 31.5 0.802 0.161 24.6 5.67 169 0.93 2.7×
Table 1. Quantitative evaluation. MACs measures cumulative computation across all devices for the whole denoising process for generating
a single 1024 × 1024 image. w/ G.T. means calculating the metrics with the ground-truth images, while w/ Orig. means with the original
model’s samples. For PSNR, we report the w/ Orig. setting. Our method mirrors the results of the original model across all metrics while
maintaining the total MACs. It also reduces the latency on NVIDIA A100 GPUs in proportion to the number of used devices.
2.8× 3.1× 3.4× Sync. TP 1.33G 3.61s 5.33G 11.7s 18.7G 46.3s
3.0 3.35 14 4.9× 80 6.1× Sync. PP 0.42G 2.21s 1.48G 5.62s 5.38G 24.7s
13.3 76.6
DistriFusion (Ours) 0.42G 1.77s 1.48G 4.81s 5.38G 22.9s
2.26
1.5 1.77 7 7.60 40 No Comm. – 1.48s – 4.14s – 21.3s
41.3
4.81 22.9
0.0 0 0
Table 2. Communication cost comparisons with 8 A100s across
1024 × 1024 2048 × 2048 3840 × 3840 different resolutions. Sync. TP/PP: Synchronous tensor/patch
Figure 6. Measured total latency of DistriFusion with the 50-step parallelism. No Comm.: An ideal no communication PP. Comm.
DDIM sampler [61] for generating a single image across different measures the total communication amount. PP only requires less
resolutions on NVIDIA A100 GPUs. When scaling up the res- than 13 communication amounts compared to TP. Our DistriFusion
olution, the GPU devices are better utilized. Remarkably, when further reduces the communication overhead by 50 ∼ 60%.
generating 3840 × 3840 images, DistriFusion achieves 1.8×, 3.4×
and 6.1× speedups with 2, 4, and 8 A100s, respectively. sources. Therefore, PP requires 60% fewer communication
amounts and is 1.6 ∼ 2.1× faster than TP, making it a more
as observed in SIGE [23]. In practical use, the batch size efficient approach for deploying diffusion models. We also
often doubles due to classifier-free guidance [12]. We include a theoretical PP baseline without any communication
can first split the batch and then apply DistriFusion to (No Comm.) to demonstrate the communication overhead
each batch separately. This approach further improves the in Sync. PP and DistriFusion. Compared to Sync. PP,
total speedups to 3.6× and 6.6× with 4 and 8 A100s for DistriFusion further cuts such overhead by over 50%. The
generating a single 3840 × 3840 image, respectively. remaining overhead mainly comes from our current usage
of NVIDIA Collective Communication Library (NCCL) for
5.3. Ablation Study asynchronous communication. NCCL kernels use SMs (the
Compare to tensor parallelism. In Table 2, we bench- computing resources on GPUs), which will slow down the
mark our latency with synchronous tensor parallelism (Sync. overlapped computation. Using remote memory access can
TP) and synchronous patch parallelism (Sync. PP), and re- bypass this issue and close the performance gap.
port the corresponding communication amounts. Compared Input similarity. Our displaced patch parallelism relies on
to TP, PP has better independence, which eliminates the the assumption that the inputs from consecutive denoising
need for communication within cross-attention and linear steps are similar. To support this claim, we quantitatively
layers. For convolutional layers, communication is only calculate the model input difference across all consecutive
required at the patch boundaries, which represent a min- steps using a 50-step DDIM sampler. The average differ-
imal portion of the entire tensor. Moreover, PP utilizes ence is only 0.02, within the input range of [−4, 4] (about
AllGather over AllReduce, leading to lower commu- 0.3%). Figure 7 further qualitatively visualizes the input
nication demands and no additional use of computing re- difference between steps 9 and 8 (randomly selected). The
Step 9 Input x9 Step 8 Input x8 Input Difference | x9 − x8 | Separate GN Stale GN Sync. GN Ours
Original
LPIPS: 0.317 LPIPS: 0.247 LPIPS: 0.207 LPIPS: 0.211
Latency: 5.02s
Latency: 1.64s Latency: 1.76s Latency: 1.85s Latency: 1.77s
difference is nearly all zero, substantiating our hypothesis of 6. Conclusion & Discussion
high similarity between inputs from neighboring steps.
Few-step sampling and warm-up steps. As stated above, In this paper, we introduce DistriFusion to accelerate
our approach hinges on the observation that adjacent denois- diffusion models with multiple GPUs for parallelism. Our
ing steps share similar inputs, i.e., xt ≈ xt−1 . However, method divides images into patches, assigning each to a
as we increase the step size and thereby reduce the num- separate GPU. We reuse the pre-computed activations from
ber of steps, the approximation error escalates, potentially previous steps to maintain patch interactions. On Stable
compromising the effectiveness of our method. In Figure 8, Diffusion XL, our method achieves up to a 6.1× speedup on
we present results using 10-step DPM-Solver [32, 33]. The 8 NVIDIA A100s. This advancement not only enhances the
10-step configuration is the threshold for the training-free efficiency of AI-generated content creation but also sets a
samplers to maintain the image quality. Under this setting, new benchmark for future research in parallel computing for
naïve DistriFusion without warm-up struggles to preserve AI applications.
the image quality. However, incorporating an additional two- Limitations. To fully hide the communication overhead
step warm-up significantly recovers the performance with within the computation, NVLink is essential for DistriFu-
only slightly increased latency. sion to maximize the speedup. However, NVLink has been
GroupNorm. As discussed in Section 4, calculating accu- widely used recently. Moreover, quantization [25] can also
rate group normalization (GN) statistics is crucial for pre- reduce the communication workloads for our method. Be-
serving image quality. In Figure 9, we compare four different sides, DistriFusion has limited speedups for low-resolution
GN schemes. The first approach Separate GN uses statistics images as the devices are underutilized. Advanced compil-
from the on-device fresh patch. This approach delivers the ers [1, 4] would help to exploit the devices and achieve better
best speed at the cost of lower image fidelity. This compro- speedups. Our method may not work for the extremely-few-
mise is particularly severe for large numbers of used devices, step methods [34–36, 57, 63], due to the rapid changes of the
due to insufficient patch size for precise statistics estimation. denoising states. Yet our preliminary experiment suggests
The second scheme Stale GN computes statistics using stale that slightly more steps (e.g., 10) are enough for DistriFusion
activations. However, this method also faces quality degrada- to obtain high-quality results.
Acknowledgments [14] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple
diffusion: End-to-end diffusion for high resolution images.
We thank Jun-Yan Zhu and Ligeng Zhu for their helpful arXiv preprint arXiv:2301.11093, 2023. 3
discussion and valuable feedback. The project is supported [15] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat,
by MIT-IBM Watson AI Lab, Amazon, MIT Science Hub, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,
and National Science Foundation. Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of
giant neural networks using pipeline parallelism. NeurIPS,
2019. 3
References
[16] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.
[1] NVIDIA/TensorRT. 2023. 7, 9 Speeding up convolutional neural networks with low rank
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- expansions. In BMVC, 2014. 3
aming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli [17] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and
Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion model parallelism for deep neural networks. MLSys, 2019. 3
models with an ensemble of expert denoisers. arXiv preprint [18] Patrick Judd, Alberto Delmas, Sayeh Sharify, and An-
arXiv:2211.01324, 2022. 2, 3, 5 dreas Moshovos. Cnvlutin2: Ineffectual-activation-and-
[3] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng weight-free deep neural network computing. arXiv preprint
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, arXiv:1705.00125, 2017. 3
Yufei Guo, et al. Improving image generation with better cap- [19] Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-
tions. Computer Science. https://cdn.openai.com/papers/dall- guided image manipulation using diffusion models. arXiv
e-3.pdf, 2023. 2 preprint arXiv:2110.02711, 2021. 3
[4] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, [20] Zhifeng Kong and Wei Ping. On fast sampling of diffusion
Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, probabilistic models. In ICML Workshop on Invertible Neu-
Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to- ral Networks, Normalizing Flows, and Explicit Likelihood
End} optimizing compiler for deep learning. In OSDI, 2018. Models, 2021. 2
7, 9
[21] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and
[5] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Hans Peter Graf. Pruning filters for efficient convnets. ICLR,
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2016. 3
Microsoft coco captions: Data collection and evaluation
[22] Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu,
server. arXiv preprint arXiv:1504.00325, 2015. 3, 6, 9
and Song Han. Gan compression: Efficient architectures for
[6] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang interactive conditional gans. In CVPR, 2020. 6
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiao- [23] Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song Han,
fang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image and Jun-Yan Zhu. Efficient spatially sparse inference for
generation models using photogenic needles in a haystack. conditional gans and diffusion models. In NeurIPS, 2022. 2,
arXiv preprint arXiv:2309.15807, 2023. 2 3, 5, 6, 8
[7] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. [24] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and
More is less: A more complicated network with less inference Xiaoou Tang. Not all pixels are equal: Difficulty-aware se-
complexity. In CVPR, 2017. 3 mantic segmentation via deep layer cascade. In CVPR, 2017.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 3
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [25] Xiuyu Li, Long Lian, Yijiang Liu, Huanrui Yang, Zhen
Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014. Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer.
3 Q-diffusion: Quantizing diffusion models. arXiv preprint
[9] Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, and arXiv:2302.04304, 2023. 2, 3, 9
Navdeep Jaitly. Matryoshka diffusion models. arXiv preprint [26] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun
arXiv:2310.15111, 2023. 3 Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion:
[10] Song Han, Jeff Pool, John Tran, and William Dally. Learning Text-to-image diffusion model on mobile devices within two
both weights and connections for efficient neural network. seconds. NeurIPS, 2023. 2, 3
NeurIPS, 2015. 3 [27] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo,
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- Hao Zhang, D. Song, and I. Stoica. Terapipe: Token-level
hard Nessler, and Sepp Hochreiter. Gans trained by a two pipeline parallelism for training large-scale language models.
time-scale update rule converge to a local nash equilibrium. ICML, 2021. 3
NeurIPS, 2017. 6 [28] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu,
[12] Jonathan Ho and Tim Salimans. Classifier-free diffusion Ying Sheng, Xin Jin, Yanping Huang, Z. Chen, Hao Zhang,
guidance. In NeurIPS 2021 Workshop on Deep Generative Joseph E. Gonzalez, and I. Stoica. Alpaserve: Statistical
Models and Downstream Applications, 2021. 3, 8 multiplexing with model parallelism for deep learning serv-
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- ing. USENIX Symposium on Operating Systems Design and
sion probabilistic models. NeurIPS, 2020. 2, 3 Implementation, 2023. 3
[29] Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. [43] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Mcunetv2: Memory-efficient patch-based inference for tiny Zhu. Semantic image synthesis with spatially-adaptive nor-
deep learning. In Annual Conference on Neural Information malization. In CVPR, 2019. 6
Processing Systems (NeurIPS), 2021. 3 [44] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On
[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, aliased resizing and surprising subtleties in GAN evaluation.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence In IEEE/CVF Conference on Computer Vision and Pattern
Zitnick. Microsoft coco: Common objects in context. In Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,
Computer Vision–ECCV 2014: 13th European Conference, 2022, pages 11400–11410. IEEE, 2022. 6
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part [45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
V 13, pages 740–755. Springer, 2014. 6 James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
[31] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: an
and Marianna Pensky. Sparse convolutional neural networks. imperative style, high-performance deep learning library. In
In CVPR, 2015. 3 NeurIPS, 2019. 6
[32] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [46] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
sion probabilistic model sampling in around 10 steps. arXiv Sdxl: Improving latent diffusion models for high-resolution
preprint arXiv:2206.00927, 2022. 2, 3, 9 image synthesis. In ICLR, 2024. 2, 3, 4, 6
[33] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [47] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yux-
Li, and Jun Zhu. Dpm-solver++: Fast solver for guided iong He. Zero: Memory optimizations toward training trillion
sampling of diffusion probabilistic models. arXiv preprint parameter models. Sc20: International Conference For High
arXiv:2211.01095, 2022. 2, 9 Performance Computing, Networking, Storage And Analysis,
[34] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang 2019. 3
Zhao. Latent consistency models: Synthesizing high-
[48] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
resolution images with few-step inference. arXiv preprint
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
arXiv: 2310.04378, 2023. 9
Zero-shot text-to-image generation. In ICML, 2021. 2
[35] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von
[49] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang
and Mark Chen. Hierarchical text-conditional image genera-
Zhao. Lcm-lora: A universal stable-diffusion acceleration
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.
module. arXiv preprint arXiv: 2311.05556, 2023.
2
[36] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er-
mon, Jonathan Ho, and Tim Salimans. On distillation of [50] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux-
guided diffusion models. arXiv preprint arXiv:2210.03142, iong He. Deepspeed: System optimizations enable training
2022. 2, 3, 9 deep learning models with over 100 billion parameters. In Pro-
ceedings of the 26th ACM SIGKDD International Conference
[37] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun
on Knowledge Discovery & Data Mining, pages 3505–3506,
Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image
2020. 3
synthesis and editing with stochastic differential equations.
In ICLR, 2022. 6 [51] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi,
[38] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li,
Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gib- and Yuxiong He. Zero-offload: Democratizing billion-scale
bons, and Matei Zaharia. Pipedream: Generalized pipeline model training. In 2021 USENIX Annual Technical Confer-
parallelism for dnn training. In SOSP, 2019. 3 ence, USENIX ATC 2021, July 14-16, 2021, pages 551–564.
[39] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Pat- USENIX Association, 2021. 3
wary, V. Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, [52] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Ur-
J. Bernauer, Bryan Catanzaro, Amar Phanishayee, and M. tasun. Sbnet: Sparse blocks network for fast inference. In
Zaharia. Efficient large-scale language model training on CVPR, 2018. 3
gpu clusters using megatron-lm. International Conference [53] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Oct-
for High Performance Computing, Networking, Storage and net: Learning deep 3d representations at high resolutions. In
Analysis, 2021. 3 CVPR, 2017. 3
[40] Alexander Quinn Nichol and Prafulla Dhariwal. Improved [54] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
denoising diffusion probabilistic models. In ICML, 2021. 5 Patrick Esser, and Björn Ommer. High-resolution image
[41] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, synthesis with latent diffusion models. In CVPR, 2022. 2, 3,
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, 6
and Mark Chen. Glide: Towards photorealistic image genera- [55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
tion and editing with text-guided diffusion models. In ICML, net: Convolutional networks for biomedical image segmenta-
2022. 3 tion. In Medical Image Computing and Computer-Assisted
[42] Bowen Pan, Wuwei Lin, Xiaolin Fang, Chaoqin Huang, Bolei Intervention–MICCAI 2015: 18th International Conference,
Zhou, and Cewu Lu. Recurrent residual module for fast Munich, Germany, October 5-9, 2015, Proceedings, Part III
inference in videos. In CVPR, 2018. 3 18, pages 234–241. Springer, 2015. 4
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, [73] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffu-
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael sion models with exponential integrator. In ICLR, 2022. 2,
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- 3
torealistic text-to-image diffusion models with deep language [74] Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim:
understanding. NeurIPS, 2022. 2 Generalized denoising diffusion implicit models. 2022. 3
[57] Tim Salimans and Jonathan Ho. Progressive distillation for [75] Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen,
fast sampling of diffusion models. In ICLR, 2021. 2, 3, 9 and Ming yu Liu. Diffcollage: Parallel generation of large
[58] Shaohuai Shi and Xiaowen Chu. Speeding up convolutional content with diffusion models. In CVPR, 2023. 3
neural networks by exploiting the sparsity of rectifier units. [76] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
arXiv preprint arXiv:1704.07724, 2017. 3 and Oliver Wang. The unreasonable effectiveness of deep
[59] Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, features as a perceptual metric. In CVPR, 2018. 6
and Nima Anari. Parallel sampling of diffusion models. [77] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-
NeurIPS, 2023. 3, 4, 6, 7 Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle
[60] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling
and Surya Ganguli. Deep unsupervised learning using fully sharded data parallel. arXiv preprint arXiv:2304.11277,
nonequilibrium thermodynamics. In ICML, 2015. 2 2023. 3
[61] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising [78] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang,
diffusion implicit models. In ICLR, 2020. 2, 6, 8 Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu,
[62] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based and {Intra-Operator} parallelism for distributed deep learning.
generative modeling through stochastic differential equations. In 16th USENIX Symposium on Operating Systems Design
In ICLR, 2020. 3 and Implementation (OSDI 22), pages 559–578, 2022. 3
[63] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
Consistency models. 2023. 9
[64] Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song
Han. Torchsparse: Efficient point cloud inference engine. In
MLSys, 2022. 3
[65] Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong-
ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han.
Torchsparse++: Efficient training and inference framework
for sparse convolution on gpus. In MICRO, 2023. 3
[66] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. 34:11287–11302, 2021.
3
[67] Leslie G. Valiant. A bridging model for parallel computation.
Commun. ACM, 33(8):103–111, 1990. 3
[68] Yuxin Wu and Kaiming He. Group normalization. In ECCV,
2018. 5
[69] Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo
Durand, and Song Han. Fastcomposer: Tuning-free multi-
subject image generation with localized attention. arXiv,
2023. 5
[70] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling
the generative learning trilemma with denoising diffusion
GANs. In ICLR, 2022. 2
[71] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hecht-
man, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry
Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang,
Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and
Zhifeng Chen. Gspmd: General and scalable parallelization
for ml computation graphs. arXiv preprint arXiv: 2105.04663,
2021. 3
[72] Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo,
Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan
Wu, Haoran Zhang, and Jie Zhao. Oneflow: Redesign the
distributed deep learning framework from scratch. arXiv
preprint arXiv: 2110.15032, 2021. 3