0% found this document useful (0 votes)
22 views12 pages

DistriFusion - Distributed Parallel Inference For High - Resolution Diffusion Models

Uploaded by

Jeremy Wayin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

DistriFusion - Distributed Parallel Inference For High - Resolution Diffusion Models

Uploaded by

Jeremy Wayin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Muyang Li1 * Tianle Cai2 * Jiaxin Cao3 Qinsheng Zhang4 Han Cai1
Junjie Bai3 Yangqing Jia3 Ming-Yu Liu4 Kai Li2 Song Han1,4
1
MIT 2 Princeton 3 Lepton AI 4 NVIDIA
https://github.com/mit-han-lab/distrifuser
arXiv:2402.19481v3 [cs.CV] 15 Apr 2024

Naïve Parallelization, 4 GPUs DistriFusion (Ours), 4 GPUs


Original, 1 GPU
MACs Per Device: 190T (4.8× Less) MACs Per Device: 227T (4.0× Less)
MACs: 907T
Latency: 3.14s (3.9× Faster) Latency: 4.16s (3.0× Faster)
Latency: 12.3s
But w/ Artifact: Duplicated Subjects w/o Artifacts

Prompt: Ethereal fantasy concept art of an elf, magnificent, celestial, ethereal, painterly, epic, majestic, magical, fantasy art, cover art, dreamy.

Prompt: Romantic painting of a ship sailing in a stormy sea, with dramatic lighting and powerful waves.
Figure 1. We introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without
sacrificing image quality. Naïve Patch (Figure 2(b)) suffers from the fragmentation issue due to the lack of patch interaction. Our
DistriFusion removes artifacts and avoids the communication overhead by reusing the features from the previous steps. Setting: SDXL
with 50-step Euler sampler, 1280 × 1920 resolution. Latency is measured on A100s.

Abstract parallelism across multiple GPUs. Our method splits the


model input into multiple patches and assigns each patch to
Diffusion models have achieved great success in syn- a GPU. However, naïvely implementing such an algorithm
thesizing high-quality images. However, generating high- breaks the interaction between patches and loses fidelity,
resolution images with diffusion models is still challenging while incorporating such an interaction will incur tremen-
due to the enormous computational costs, resulting in a pro- dous communication overhead. To overcome this dilemma,
hibitive latency for interactive applications. In this paper, we observe the high similarity between the input from adja-
we propose DistriFusion to tackle this problem by leveraging cent diffusion steps and propose displaced patch parallelism,
which takes advantage of the sequential nature of the dif-
* indicates equal contributions. fusion process by reusing the pre-computed feature maps
from the previous timestep to provide context for the current
step. Therefore, our method supports asynchronous commu- U-Net … U-Net
nication, which can be pipelined by computation. Extensive
experiments show that our method can be applied to recent (a) Original
Stable Diffusion XL with no quality degradation and achieve
up to a 6.1× speedup on eight A100 GPUs compared to one. U-Net U-Net … U-Net

1. Introduction
The advent of AI-generated content (AIGC) represents a U-Net U-Net … U-Net
seismic shift in technological innovation. Tools like Adobe (b) Naïve Patch
Firefly, Midjourney and recent Sora showcase astonishing ca-
pabilities, producing compelling imagery and designs from
simple text prompts. These achievements are notably sup-
U-Net U-Net … U-Net
ported by the progression in diffusion models [13, 60]. The
emergence of large text-to-image models, including Stable ⨁ ⨁ …
Diffusion [54], Imgen [56], eDiff-I [2], DALL·E [3, 48, 49]
and Emu [6], further expands the horizons of AI creativity. U-Net U-Net … U-Net
Trained on diverse open-web data, these models can
(c) DistriFusion
generate photorealistic images from text descriptions alone.
Such technological revolution unlocks numerous synthesis Computation Asynchronous Communication
and editing applications for images and videos, placing new On Device 1 On Device 2 ⨁ Asynchronous AllGather
demands on responsiveness: by interactively guiding and Figure 2. (a) Original diffusion model running on a single device.
refining the model output, users can achieve more person- (b) Naïvely splitting the image into 2 patches across 2 GPUs has
alized and precise results. Nonetheless, a critical challenge an evident seam at the boundary due to the absence of interaction
remains – high resolution leading to large computation. For across patches. (c) DistriFusion employs synchronous communica-
example, the original Stable Diffusion [54] is limited to tion for patch interaction at the first step. After that, we reuse the
generating 512 × 512 images. Later, SDXL [46] expands activations from the previous step via asynchronous communica-
the capabilities to 1024 × 1024 images. More recently, Sora tion. In this way, the communication overhead can be hidden into
further pushes the boundaries by enabling video generation the computation pipeline.
at 1080 × 1920 resolution. Despite these advancements,
the increased latency of generating high-resolution images several patches, assigning each patch to a different device for
presents a tremendous barrier to real-time applications. generation, as illustrated in Figure 2(b). This method allows
Recent efforts to accelerate diffusion model inference for independent and parallel operations across devices.
have mainly focused on two approaches: reducing sampling However, it suffers from a clearly visible seam at the
steps [20, 32, 33, 36, 57, 61, 70, 73] and optimizing neu- boundaries of each patch due to the absence of interaction
ral network inference [23, 25, 26]. As computational re- between the individual patches. However, introducing
sources grow rapidly, leveraging multiple GPUs to speed interactions among patches to address this issue would incur
up inference is appealing. For example, in natural language excessive synchronization costs again, offsetting the benefits
processing (NLP), large language models have successfully of parallel processing.
harnessed tensor parallelism across GPUs, significantly re- In this work, we present DistriFusion, a method that
ducing latency. However, for diffusion models, multiple enables running diffusion models across multiple devices
GPUs are usually only used for batch inference. When gen- in parallel to reduce the latency of single-sample generation
erating a single image, typically only one GPU is involved without hurting image quality. As depicted in Figure 2(c),
(Figure 2(a)). Techniques like tensor parallelism are less our approach is also based on patch parallelism, which
suitable for diffusion models due to the large activation size, divides the image into multiple patches, each assigned to
as communication costs outweigh savings from distributed a different device. Our key observation is that the inputs
computation. Thus, even when multiple GPUs are available, across adjacent denoising steps in diffusion models are
they cannot be effectively exploited to further accelerate similar. Therefore, we adopt synchronous communication
single-image generation. This motivates the development of solely for the first step. For the subsequent steps, we
a method that can utilize multiple GPUs to speed up single- reuse the pre-computed activations from the previous step
image generation with diffusion models. to provide global context and patch interactions for the
A naïve approach would be to divide the image into current step. We further co-design an inference framework
to implement our algorithm. Specifically, our framework a new paradigm for accelerating diffusion by leveraging
effectively hides the communication overhead within parallelism to the neural network on multiple devices.
the computation via asynchronous communication. It Parallelism. Existing work has explored various parallelism
also sparsely runs the convolutional and attention layers strategies to accelerate the training and inference of large lan-
exclusively on the assigned regions, thereby proportionally guage models (LLMs), including data, pipeline [15, 27, 38],
reducing per-device computation. Our method, distinct tensor [17, 39, 71, 72, 78], and zero-redundancy paral-
from data, tensor, or pipeline parallelism, introduces a new lelism [47, 50, 51, 77]. Tensor parallelism, in particular,
parallelization opportunity: displaced patch parallelism. has been widely adopted for accelerating LLMs [28], which
DistriFusion only requires off-the-shelf pre-trained diffu- are characterized by their substantial model sizes, whereas
sion models and is applicable to a majority of few-step sam- their activation sizes are relatively small. In such scenarios,
plers. We benchmark it on a subset of COCO Captions [5]. the communication overhead introduced by tensor paral-
Without loss of visual fidelity, it mirrors the performance lelism is relatively minor compared to the substantial latency
of the original Stable Diffusion XL (SDXL) [46] while benefits brought by increased memory bandwidth. However,
reducing the computation* proportionally to the number of the situation differs for diffusion models, which are gener-
used devices. Furthermore, our framework also reduces the ally smaller than LLMs but are often bottlenecked by the
latency of SDXL U-Net for generating a single image by up large activation size due to the spatial dimensions, especially
to 1.8×, 3.4× and 6.1× with 2, 4, and 8 A100 GPUs, respec- when generating high-resolution content. The communica-
tively. When combined with batch splitting for classifier-free tion overhead from tensor parallelism becomes a significant
guidance [12], we achieve in total 3.6× and 6.6× speedups factor, overshadowing the actual computation time. As a
using 4 and 8 A100 GPUs for 3840 × 3840 images, respec- result, only data parallelism has been used thus far for dif-
tively. See Figure 1 for some examples of our method. fusion model serving, which provides no latency improve-
ments. The only exception is ParaDiGMS [59], which uses
2. Related Work Picard iteration to run multiple steps in parallel. However,
this sampler tends to waste much computation, and the gen-
Diffusion models. Diffusion models have significantly trans- erated results exhibit significant deviation from the original
formed the landscape of content generation [2, 13, 41, 46]. diffusion model. Our method is based on patch parallelism,
At its core, these models synthesize content through an which distributes the computation across multiple devices
iterative denoising process. Although this iterative approach by splitting the input into small patches. Compared to tensor
yields unprecedented capabilities for content generation, parallelism, such a scheme has superior independence and
it requires substantially more computational resources and reduced communication demands. Additionally, it favors
results in slower generative speed. This issue intensifies the use of AllGather over AllReduce for data interac-
with the synthesis of high-dimensional data, such as high- tion, significantly lowering overhead (see Section 5.3 for the
resolution [9, 14] or 360◦ images [75]. Researchers have full comparisons). Drawing inspiration from the success of
investigated various perspectives to accelerate the diffusion asynchronous communication in parallel computing [67], we
model. The first line lies in designing more efficient denois- further reuse the features from the previous step as context
ing processes. Rombach et al. [54] and Vahdat et al. [66] for current step to overlap communication and computation,
propose to compress high-resolution images into low- called displaced patch parallelism. This represents the first
resolution latent representations and learn diffusion model in parallelism strategy tailored to the sequential characteristics
latent space. Another line lies in improving sampling via de- of diffusion models while avoiding the heavy communication
signing efficient training-free sampling algorithms. A large costs of traditional techniques like tensor parallelism.
category of works along this line is built upon the connection
Sparse computation. Sparse computation has been exten-
between diffusion models and differential equations [62],
sively researched in various domains, including weight [10,
and leverage a well-established exponential integra-
16, 21, 31], input [53, 64, 65] and activation [7, 18, 23, 24,
tor [32, 73, 74] to reduce sampling steps while maintaining
42, 52, 52, 58]. In the activation domain, to facilitate on-
numerical accuracy. The third strategy involves distilling
hardware speedups, several studies propose to use structured
faster generative models from pre-trained diffusion models.
sparsity. SBNet [52] employs a spatial mask to sparsify acti-
Despite significant progress made in this area, a quality gap
vations for accelerating 3D object detection. This mask can
persists between these expedited generators and diffusion
be derived either from prior problem knowledge or an auxil-
models [19, 36, 57]. In addition to the above schemes, some
iary network. In the context of image generation, SIGE [23]
works investigate how to optimize the neural inference for
leverages the highly structured sparsity of user edits, selec-
diffusion models [23, 25, 26]. In this work, we explore
tively performing computation at the edited regions to speed
* Following previous works, we measure the computational cost with the up GANs [8] and diffusion models. MCUNetV2[29] adopts
number of Multiply-Accumulate operations (MACs). 1 MAC=2 FLOPs. a patch-based inference to reduce memory usage for image
classification and detection. In our work, we also partition However, the first approach leads to visible discrepancies at
the input into patches, each processed by a different device. the boundaries of each patch due to the absence of interaction
However, we focus on reducing the latency by parallelism between them (see Figure 1 and Figure 2(b)). The second
for image generation instead. Each device will solely process approach, on the other hand, incurs excessive communica-
the assigned regions to reduce the per-device computation. tion overheads, negating the benefits of parallel processing.
To address these challenges, we propose a novel parallelism
3. Background paradigm, displaced patch parallelism, which leverages the
sequential nature of diffusion models to overlap communica-
To generate a high-quality image, a diffusion model often
tion and computation. Our key insight is reusing slightly out-
trains a noise-prediction neural model (e.g., U-Net [55])
dated, or ‘stale’ activations from the previous diffusion step
ϵθ . Starting from pure Gaussian noise xT ∼ N (0, I), it
to facilitate interactions between patches, which we describe
involves tens to hundreds of iterative denoising steps to get
as activation displacement. This is based on the observation
the final clean image x0 , where T is the total number of
that the inputs for consecutive denoising steps are relatively
steps. Specifically, given the noisy image xt at time step t,
similar. Consequently, computing each patch’s activation
the model ϵθ takes xt , t and an additional condition c (e.g.,
at a layer does not rely on other patches’ fresh activations,
text) as inputs to predict the corresponding noise ϵt within
allowing communication to be hidden within subsequent lay-
xt . At each denoising step, xt−1 can be derived from the
ers’ computation. We will next provide a detailed breakdown
following equation:
of each aspect of our algorithm and system design.
xt−1 = Update(xt , t, ϵt ), ϵt = ϵθ (xt , t, c). (1) Displaced patch parallelism. As shown in Figure 3, when
predicting ϵθ (xt ) (we omit the inputs of timestep t and con-
Here, ‘Update’ refers to a sampler-specific function that typ- dition c here for simplicity), we first split xt into multiple
ically includes element-wise additions and multiplications. (1) (2) (N )
patches xt , xt , . . . , xt , where N is the number of de-
Therefore, the primary source of latency in this process is
vices. For example, we use N = 2 in Figure 3. Each device
the forward passes through model ϵθ . For example, Stable
has a replicate of the model ϵθ and will process a single
Diffusion XL [46] requires 6,763 GMACs per step to gen-
patch independently, in parallel.
erate a 1024 × 1024 image. This computational demand
escalates more than quadratically with increasing resolution, For a given layer l, let’s consider the input activation
l,(i)
making the latency for generating a single high-resolution patch on the i-th device, denoted as At . This patch is first
image impractically high for real-world applications. Fur- scattered into the stale activations from the previous step,
thermore, given that xt−1 depends on xt , parallel computa- Alt+1 , at its corresponding spatial location (the method for
tion of ϵt and ϵt−1 is challenging. Hence, even with multi- obtaining Alt+1 will be discussed later). Here, Alt+1 is in full
ple idle GPUs, accelerating the generation of a single high- spatial shape. In the Scatter output, only the N1 regions
l,(i)
resolution image remains tricky. Recently, Shih et al. intro- where At is placed are fresh and require recomputation.
duced ParaDiGMS [59], employing Picard iterations to paral- We then selectively apply the layer operation Fl (linear,
lelize the denoising steps in a data-parallel manner. However, convolution, or attention) to these fresh areas, thereby gener-
ParaDiGMS wastes the computation on speculative guesses ating the output for the corresponding regions. This process
that fail quality thresholds. It also relies on a large total is repeated for each layer. Finally, the outputs from all layers
step count T to exploit multi-GPU data parallelism, limiting are synchronized together to approximate ϵθ (xt ). Through
its potential applications. Another conventional method is this methodology, each device is responsible for only N1 of
sharding the model on multiple devices and using tensor the total computations, enabling efficient parallelization.
parallelism for inference. However, this method suffers from
There still remains a problem of how to obtain the stale
intolerable communication costs, making it impractical for
activations from the previous step. As shown in Figure 3, at
real-world applications. Beyond these two schemes, are l,(i)
each timestep t, when device i acquires At , it will then
there alternative strategies for distributing workloads across
broadcast the activations to all other devices and perform the
multiple GPU devices so that single-image generation can
AllGather operation. Modern GPUs often support asyn-
also enjoy the free-lunch speedups from multiple devices?
chronous communication and computation, which means
4. Method that this AllGather process does not block ongoing com-
putations. By the time we reach layer l in the next timestep,
The key idea of DistriFusion is to parallelize computation each device should have already received a replicate of Alt .
across devices by splitting the image into patches. Naïvely, Such an approach effectively hides communication over-
this can be done by either (1) independently computing heads within the computation phase, as shown in Figure 4.
patches and stitching them together, or (2) synchronously However, there is an exception: the very first step (i.e., xT ).
communicating intermediate activations between patches. In this scenario, each device simply executes the standard
Computation Asynchronous Communication
… … … …
A1t+1 Alt+1 ALt+1

OP
Scatter
Fl

x(1)
t
Al,(1)
t ≈ Fl(Alt )(1)
Layer 1 … … Layer L

Al,(2) Alt+1
xt t
≈ ϵθ(xt )
OP
Scatter
Fl

x(2)
t AllGather Layer l ≈ Fl(Alt )(2)
Update
xt−1 A1t Alt ALt
… … … …
Figure 3. Overview of DistriFusion. For simplicity, we omit the inputs of t and c, and use N = 2 devices as an example. Superscripts (1) and
(2)
represent the first and the second patch, respectively. Stale activations from the previous step are darkened. At each step t, we first split
(1) (N ) l,(i)
the input xt into N patches xt , . . . , xt . For each layer l and device i, upon getting the input activation patches At , two operations
l,(i) l
then process asynchronously: First, on device i, At is scattered back into the stale activation At+1 from the previous step. The output
of this Scatter operation is then fed into the sparse operator Fl (linear, convolution, or attention layers), which performs computations
l,(i)
exclusively on the fresh regions and produces the corresponding output. Meanwhile, an AllGather operation is performed over At
l
to prepare the full activation At for the next step. We repeat this procedure for each layer. The final outputs are then aggregated together
to approximate ϵθ (xt ), which is used to compute xt−1 . The timeline visualization of each device for predicting ϵθ (xt ) is shown in Figure 4.

Layer 1 Layer 2 Layer L


that either normalizing only the fresh patches or reusing
Device Sparse Op F1 Sparse Op F2 Sparse Op FL
1-N
Scatter Scatter Scatter stale features degrades image quality. However, aggregating
… all the normalization statistics will incur considerable
Comm. AllGather AllGather AllGather
overhead due to the synchronous communication. To solve
Figure 4. Timeline visualization on each device when predicting this dilemma, we additionally introduce a correction term to
ϵθ (xt ). Comm. means communication, which is asynchronous the stale statistics. Specifically, for each device i at a given
with computation. The AllGather overhead is fully hidden step t, every GN layer can compute the group-wise mean of
within the computation. (i) (i)
its fresh patch At , denoted as E[At ]. For simplicity, we
omit the layer index l here. It also has cached the local mean
synchronous communication and caches the intermediate (i)
E[At+1 ] and aggregated global mean E[At+1 ] from the
activations for the next step.
previous step. Then the approximated global mean E[At ]
Sparse operations. For each layer l, we modify the original for current step on device i can be computed as
operator Fl to enable sparse computation selectively on the
(i) (i)
fresh areas. Specifically, if Fl is a convolution, linear, or E[At ] ≈ E[At+1 ] + (E[At ] − E[At+1 ]) . (2)
cross-attention layer, we apply the operator exclusively to | {z } | {z }
stale global mean correction
the newly refreshed regions, rather than the full feature map.
This can be achieved by extracting the fresh sections from We use the same technique to approximate E[(At )2 ], then
the scatter output and feeding them into Fl . For layers the variance can be approximated as E[(At )2 ] − E[At ]2 .
where Fl is a self-attention layer, we transform it into a We then use these approximated statistics for the GN layer
cross-attention layer, similar to SIGE [23]. In this setting, and in the meantime aggregate the local mean and variance
only the query tokens from the fresh areas are preserved on to compute the precise ones using asynchronous communi-
the device, while the key and value tokens still encompass cation. Thus, the communication cost can also be pipelined
the entire feature map (the scatter output). Thus, the into the computation. We empirically find this method yields
computational cost for Fl is exactly proportional to the size comparable results to the direct synchronous aggregation.
of the fresh area. However, there are some rare cases where the approximated
Corrected asynchronous GroupNorm. Diffusion models variance is negative. For these negative variance groups, we
often adopt group normalization (GN) [40, 68] layers in will fall back to use the local variance of the fresh patch.
the network. These layers normalize across the spatial Warm-up steps. As observed in eDiff-I [2] and FastCom-
dimension, necessitating the aggregation of activations to poser [69], the behavior of diffusion synthesis undergoes
restore their full spatial shape. In Section 5.3, we discover qualitative changes throughout the denoising process.
Specifically, the initial steps of sampling predominantly evaluate the image quality with standard metrics: Peak Sig-
shape the low-frequency aspects of the image, such as spatial nal Noise Ratio (PSNR, higher is better), LPIPS (lower is
layout and overall semantics. As the sampling progresses, better) [76], and Fréchet Inception Distance (FID, lower is
the focus shifts to recovering local high-frequency details. better) [11]† . We employ PSNR to quantify the minor nu-
Therefore, to boost image quality, especially in samplers merical differences between the outputs of the benchmarked
with a reduced number of steps, we adopt warm-up steps. In- method and the original diffusion model outputs. LPIPS
stead of directly employing displaced patch parallelism after is used to evaluate perceptual similarity. Additionally, the
the first step, we continue with several iterations of the stan- FID score is used to measure the distributional differences
dard synchronous patch parallelism as a preliminary phase, between the outputs of the method and either the original
or warm-up. As detailed in Section 5.3, this integration of outputs or the ground-truth images.
warm-up steps significantly improves performance. Implementation details. By default, we adopt the 50-step
DDIM sampler [61] with classifier-free guidance scale 5
5. Experiments to generate 1024 × 1024 images, unless otherwise speci-
fied. In addition to the first step, we perform another 4-step
We first describe our experiment setups, including our synchronous patch parallelism, serving as a warm-up phase.
benchmark datasets, baselines, and evaluation protocols. We use PyTorch 2.2 [45] to benchmark the speedups
Then we present our main results regarding both quality and of our method. To measure latency, we first warm up the
efficiency. Finally, we further show some ablation studies to devices with 3 iterations of the whole denoising process,
verify each design choice. then run another 10 iterations and calculate the average
latency by discarding the results of the fastest and slowest
5.1. Setups
runs. Additionally, we use CUDAGraph to optimize some
Models. As our method only requires off-the-shelf pre- kernel launching overhead for both the original model and
trained diffusion models, we mainly conduct experiments on our method.
the state-of-the-art public text-to-image model Stable Dif-
5.2. Main Results
fusion XL (SDXL) [46]. SDXL first compresses an image
to an 8× smaller latent representation using a pre-trained Quality results. In Figure 5, we show some qualitative
auto-encoder and then applies a diffusion model in this latent visual results and report some quantitative evaluation in
space. It also incorporates multiple cross-attention layers to Table 1. with G.T. means computing the metric with the
facilitate text conditioning. Compared to the original Stable ground-truth COCO [30] images, whereas w/ Orig. refers
Diffusion [54], SDXL adopts significantly more attention to computing the metrics with the outputs from the original
layers, resulting in a more computationally intensive model. model. For PSNR, we report only the w/ Orig. setting, as
Datasets. We use the HuggingFace version of COCO the w/ G.T. comparison is not informative due to significant
Captions 2014 [5] dataset to benchmark our method. This numerical differences between the generated outputs and the
dataset contains human-generated captions for images ground-truth images.
from Microsoft Common Objects in COntext (COCO) As shown in Table 1, ParaDiGMS [59] expends consid-
dataset [30]. For evaluation, we randomly sample a subset erable computational resources on guessing future denoising
from the validation set, which contains 5K images with one steps, resulting in a much higher total MACs. Besides, it
caption per image. also suffers from some performance degradation. In contrast,
Baselines. We compare our DistriFusion against the follow- our method simply distributes workloads across multiple
ing baselines in terms of both quality and efficiency: GPUs, maintaining a constant total computation. The Naïve
• Naïve Patch. At each iteration, the input is divided row- Patch baseline, while lower in total MACs, lacks the crucial
wise or column-wise alternately. These patches are then inter-patch interaction, leading to fragmented outputs. This
processed independently by the model, without any in- limitation significantly impacts image quality, as reflected
teraction between them. The outputs are subsequently across all evaluation metrics. Our DistriFusion can well
concatenated together. preserve interaction. Even when using 8 devices, it achieves
• ParaDiGMS [59] is a technique to accelerate pre-trained comparable PSNR, LPIPS, and FID scores comparable to
diffusion models by denoising multiple steps in parallel. It those of the original model.
uses Picard iterations to guess the solution of future steps Speedups. Compared to the theoretical computation reduc-
and iteratively refines it until convergence. We use a batch tion, on-hardware acceleration is more critical for real-world
size 8 for ParaDiGMS to align with Table 4 in the original applications. To demonstrate the effectiveness of our method,
paper [59]. We empirically find this setting yields the best we also report the end-to-end latency in Table 1 on 8 NVIDIA
performance in both quality and latency. †We use TorchMetrics to calculate PSNR and LPIPS, and use Clean-

Metrics. Following previous works [22, 23, 37, 43], we FID [44] to calculate FID.
Original Naïve Patch (2 Devices) ParaDiGMS (8 Devices) Ours (2 Devices) Ours (4 Devices) Ours (8 Devices)
Latency: 5.02s Latency: 2.83s (1.8× Faster) Latency: 1.80s (2.8× Faster) Latency: 3.35s (1.5× Faster) Latency: 2.26s (2.2× Faster) Latency: 1.77s (2.8× Faster)
FID: 24.0 FID: 33.6 FID: 25.1 FID: 24.0 FID: 24.2 FID: 24.3

Prompt: A multi-colored parrot holding its foot up to its beak.

Prompt: A kid wearing headphones and using a laptop

Prompt: A pair of parking meters reflecting expired times.

Prompt: A double decker bus driving down the street.

Prompt: A brown dog laying on the ground with a metal bowl in front of him.
Figure 5. Qualitative results. FID is computed against the ground-truth images. Our DistriFusion can reduce the latency according to the
number of used devices while preserving visual fidelity.

A100 GPUs. In the 50-step setting, ParaDiGMS achieves device usage, we further scale the resolution to 2048 × 2048
an identical speedup of 2.8× to our method at the cost of and 3840 × 3840 in Figure 6. At these larger resolutions,
compromised image quality (see Figure 5). In the more com- the GPU devices are better utilized. Specifically, for
monly used 25-step setting, ParaDiGMS only has a marginal 3840 × 3840 images, DistriFusion reduces the latency by
1.3× speedup due to excessive wasted guesses, which is also 1.8×, 3.4× and 6.1× with 2, 4 and 8 A100s, respectively.
reported in Shih et al. [59]. However, our method can still Note that these results are benchmarked with PyTorch.
mirror the original quality and accelerate the model by 2.7×. With more advanced compilers, such as TVM [4] and
TensorRT [1], we anticipate even higher GPU utilization and
When generating 1024 × 1024 images, our speedups are consequently more pronounced speedups from DistriFusion,
limited by the low GPU utilization of SDXL. To maximize
LPIPS (↓) FID (↓) Latency
#Steps #Devices Method PSNR (↑) MACs (T)
w/ G.T. w/ Orig. w/ Orig. w/ G.T. Value (s) Speedup
1 Original – 0.797 – 24.0 – 338 5.02 –
Naïve Patch 28.2 0.812 0.596 33.6 29.4 322 2.83 1.8×
2
Ours 31.9 0.797 0.146 24.2 4.86 338 3.35 1.5×
Naïve Patch 27.9 0.853 0.753 125 133 318 1.74 2.9×
50 4
Ours 31.0 0.798 0.183 24.2 5.76 338 2.26 2.2×
Naïve Patch 27.8 0.892 0.857 252 259 324 1.27 4.0×
8 ParaDiGMS 29.3 0.800 0.320 25.1 10.8 657 1.80 2.8×
Ours 30.5 0.799 0.211 24.4 6.46 338 1.77 2.8×
1 Original – 0.801 – 23.9 – 169 2.52 –
25 ParaDiGMS 29.6 0.808 0.273 25.8 10.4 721 1.89 1.3×
8
Ours 31.5 0.802 0.161 24.6 5.67 169 0.93 2.7×

Table 1. Quantitative evaluation. MACs measures cumulative computation across all devices for the whole denoising process for generating
a single 1024 × 1024 image. w/ G.T. means calculating the metrics with the ground-truth images, while w/ Orig. means with the original
model’s samples. For PSNR, we report the w/ Orig. setting. Our method mirrors the results of the original model across all metrics while
maintaining the total MACs. It also reduces the latency on NVIDIA A100 GPUs in proportion to the number of used devices.

Original Ours (2 Devices) Ours (4 Devices) Ours (8 Devices)


1024 × 1024 2048 × 2048 3840 × 3840
6.0 28 160 Method
Comm. Latency Comm. Latency Comm. Latency
23.7 140
5.02
4.5 21 120 Original – 5.02s – 23.7s – 140s
1.5× 1.8×
1.8×
2.2×
Latency (s)

2.8× 3.1× 3.4× Sync. TP 1.33G 3.61s 5.33G 11.7s 18.7G 46.3s
3.0 3.35 14 4.9× 80 6.1× Sync. PP 0.42G 2.21s 1.48G 5.62s 5.38G 24.7s
13.3 76.6
DistriFusion (Ours) 0.42G 1.77s 1.48G 4.81s 5.38G 22.9s
2.26
1.5 1.77 7 7.60 40 No Comm. – 1.48s – 4.14s – 21.3s
41.3
4.81 22.9
0.0 0 0
Table 2. Communication cost comparisons with 8 A100s across
1024 × 1024 2048 × 2048 3840 × 3840 different resolutions. Sync. TP/PP: Synchronous tensor/patch
Figure 6. Measured total latency of DistriFusion with the 50-step parallelism. No Comm.: An ideal no communication PP. Comm.
DDIM sampler [61] for generating a single image across different measures the total communication amount. PP only requires less
resolutions on NVIDIA A100 GPUs. When scaling up the res- than 13 communication amounts compared to TP. Our DistriFusion
olution, the GPU devices are better utilized. Remarkably, when further reduces the communication overhead by 50 ∼ 60%.
generating 3840 × 3840 images, DistriFusion achieves 1.8×, 3.4×
and 6.1× speedups with 2, 4, and 8 A100s, respectively. sources. Therefore, PP requires 60% fewer communication
amounts and is 1.6 ∼ 2.1× faster than TP, making it a more
as observed in SIGE [23]. In practical use, the batch size efficient approach for deploying diffusion models. We also
often doubles due to classifier-free guidance [12]. We include a theoretical PP baseline without any communication
can first split the batch and then apply DistriFusion to (No Comm.) to demonstrate the communication overhead
each batch separately. This approach further improves the in Sync. PP and DistriFusion. Compared to Sync. PP,
total speedups to 3.6× and 6.6× with 4 and 8 A100s for DistriFusion further cuts such overhead by over 50%. The
generating a single 3840 × 3840 image, respectively. remaining overhead mainly comes from our current usage
of NVIDIA Collective Communication Library (NCCL) for
5.3. Ablation Study asynchronous communication. NCCL kernels use SMs (the
Compare to tensor parallelism. In Table 2, we bench- computing resources on GPUs), which will slow down the
mark our latency with synchronous tensor parallelism (Sync. overlapped computation. Using remote memory access can
TP) and synchronous patch parallelism (Sync. PP), and re- bypass this issue and close the performance gap.
port the corresponding communication amounts. Compared Input similarity. Our displaced patch parallelism relies on
to TP, PP has better independence, which eliminates the the assumption that the inputs from consecutive denoising
need for communication within cross-attention and linear steps are similar. To support this claim, we quantitatively
layers. For convolutional layers, communication is only calculate the model input difference across all consecutive
required at the patch boundaries, which represent a min- steps using a 50-step DDIM sampler. The average differ-
imal portion of the entire tensor. Moreover, PP utilizes ence is only 0.02, within the input range of [−4, 4] (about
AllGather over AllReduce, leading to lower commu- 0.3%). Figure 7 further qualitatively visualizes the input
nication demands and no additional use of computing re- difference between steps 9 and 8 (randomly selected). The
Step 9 Input x9 Step 8 Input x8 Input Difference | x9 − x8 | Separate GN Stale GN Sync. GN Ours
Original
LPIPS: 0.317 LPIPS: 0.247 LPIPS: 0.207 LPIPS: 0.211
Latency: 5.02s
Latency: 1.64s Latency: 1.76s Latency: 1.85s Latency: 1.77s

Prompt: A kitchen with a microwave, stove, cutlery and fruits.

Figure 7. Visualization of the inputs from steps 9 and 8 and their


difference. All feature maps are channel-wise averaged. The differ-
ence is nearly all zero, exhibiting high similarity.
Ours (w/o Warm-up) Ours (1-Step Warm-up) Ours (2-Step Warm-up) Prompt: An old clock reading two twenty on a gloomy day.
Original
LPIPS: 0.404 LPIPS: 0.288 LPIPS: 0.196
Latency: 1.01s Figure 9. Qualitative results of different GN schemes with 8 A100s.
Latency: 0.374s Latency: 0.388s Latency: 0.400s
LPIPS is computed against the original samples over the whole
COCO [5] dataset. Separate GN only utilizes the statistics from the
on-device patch. Stale GN reuses the stale statistics. They suffer
from quality degradation. Sync. GN synchronizes data to ensure
accurate statistics at the cost of extra overhead. Our corrected
Prompt: A small boat in the blue and green water. asynchronous GN, by correcting stale statistics, avoids the need for
synchronization and effectively restores quality.

tion, because of the different distributions between stale and


fresh activations, often resulting in images with a fog-like
noise effect. The third approach Sync. GN use synchronized
Prompt: A motorcylce sits on the pavement on a cloudy day. communication to aggregate accurate statistics. Though
Figure 8. Qualitative results on the 10-step DPM-Solver [32, 33] achieving the best image quality, it suffers from large syn-
with different warm-up steps. LPIPS is computed against the chronization overhead. Our method uses a correction term to
samples from the original SDXL over the entire COCO [5] dataset. close the distribution gap between the stale and fresh statis-
Naïve DistriFusion without warm-up steps has evident quality tics. It achieves image quality on par with Sync. GN but
degradation. Adding a 2-step warm-up significantly improves the without incurring synchronous communication overhead.
performance while avoiding high latency rise.

difference is nearly all zero, substantiating our hypothesis of 6. Conclusion & Discussion
high similarity between inputs from neighboring steps.
Few-step sampling and warm-up steps. As stated above, In this paper, we introduce DistriFusion to accelerate
our approach hinges on the observation that adjacent denois- diffusion models with multiple GPUs for parallelism. Our
ing steps share similar inputs, i.e., xt ≈ xt−1 . However, method divides images into patches, assigning each to a
as we increase the step size and thereby reduce the num- separate GPU. We reuse the pre-computed activations from
ber of steps, the approximation error escalates, potentially previous steps to maintain patch interactions. On Stable
compromising the effectiveness of our method. In Figure 8, Diffusion XL, our method achieves up to a 6.1× speedup on
we present results using 10-step DPM-Solver [32, 33]. The 8 NVIDIA A100s. This advancement not only enhances the
10-step configuration is the threshold for the training-free efficiency of AI-generated content creation but also sets a
samplers to maintain the image quality. Under this setting, new benchmark for future research in parallel computing for
naïve DistriFusion without warm-up struggles to preserve AI applications.
the image quality. However, incorporating an additional two- Limitations. To fully hide the communication overhead
step warm-up significantly recovers the performance with within the computation, NVLink is essential for DistriFu-
only slightly increased latency. sion to maximize the speedup. However, NVLink has been
GroupNorm. As discussed in Section 4, calculating accu- widely used recently. Moreover, quantization [25] can also
rate group normalization (GN) statistics is crucial for pre- reduce the communication workloads for our method. Be-
serving image quality. In Figure 9, we compare four different sides, DistriFusion has limited speedups for low-resolution
GN schemes. The first approach Separate GN uses statistics images as the devices are underutilized. Advanced compil-
from the on-device fresh patch. This approach delivers the ers [1, 4] would help to exploit the devices and achieve better
best speed at the cost of lower image fidelity. This compro- speedups. Our method may not work for the extremely-few-
mise is particularly severe for large numbers of used devices, step methods [34–36, 57, 63], due to the rapid changes of the
due to insufficient patch size for precise statistics estimation. denoising states. Yet our preliminary experiment suggests
The second scheme Stale GN computes statistics using stale that slightly more steps (e.g., 10) are enough for DistriFusion
activations. However, this method also faces quality degrada- to obtain high-quality results.
Acknowledgments [14] Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple
diffusion: End-to-end diffusion for high resolution images.
We thank Jun-Yan Zhu and Ligeng Zhu for their helpful arXiv preprint arXiv:2301.11093, 2023. 3
discussion and valuable feedback. The project is supported [15] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat,
by MIT-IBM Watson AI Lab, Amazon, MIT Science Hub, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam,
and National Science Foundation. Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of
giant neural networks using pipeline parallelism. NeurIPS,
2019. 3
References
[16] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.
[1] NVIDIA/TensorRT. 2023. 7, 9 Speeding up convolutional neural networks with low rank
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji- expansions. In BMVC, 2014. 3
aming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli [17] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and
Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion model parallelism for deep neural networks. MLSys, 2019. 3
models with an ensemble of expert denoisers. arXiv preprint [18] Patrick Judd, Alberto Delmas, Sayeh Sharify, and An-
arXiv:2211.01324, 2022. 2, 3, 5 dreas Moshovos. Cnvlutin2: Ineffectual-activation-and-
[3] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng weight-free deep neural network computing. arXiv preprint
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, arXiv:1705.00125, 2017. 3
Yufei Guo, et al. Improving image generation with better cap- [19] Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-
tions. Computer Science. https://cdn.openai.com/papers/dall- guided image manipulation using diffusion models. arXiv
e-3.pdf, 2023. 2 preprint arXiv:2110.02711, 2021. 3
[4] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, [20] Zhifeng Kong and Wei Ping. On fast sampling of diffusion
Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, probabilistic models. In ICML Workshop on Invertible Neu-
Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to- ral Networks, Normalizing Flows, and Explicit Likelihood
End} optimizing compiler for deep learning. In OSDI, 2018. Models, 2021. 2
7, 9
[21] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and
[5] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Hans Peter Graf. Pruning filters for efficient convnets. ICLR,
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2016. 3
Microsoft coco captions: Data collection and evaluation
[22] Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu,
server. arXiv preprint arXiv:1504.00325, 2015. 3, 6, 9
and Song Han. Gan compression: Efficient architectures for
[6] Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang interactive conditional gans. In CVPR, 2020. 6
Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiao- [23] Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song Han,
fang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image and Jun-Yan Zhu. Efficient spatially sparse inference for
generation models using photogenic needles in a haystack. conditional gans and diffusion models. In NeurIPS, 2022. 2,
arXiv preprint arXiv:2309.15807, 2023. 2 3, 5, 6, 8
[7] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. [24] Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and
More is less: A more complicated network with less inference Xiaoou Tang. Not all pixels are equal: Difficulty-aware se-
complexity. In CVPR, 2017. 3 mantic segmentation via deep layer cascade. In CVPR, 2017.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing 3
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [25] Xiuyu Li, Long Lian, Yijiang Liu, Huanrui Yang, Zhen
Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014. Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer.
3 Q-diffusion: Quantizing diffusion models. arXiv preprint
[9] Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, and arXiv:2302.04304, 2023. 2, 3, 9
Navdeep Jaitly. Matryoshka diffusion models. arXiv preprint [26] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun
arXiv:2310.15111, 2023. 3 Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion:
[10] Song Han, Jeff Pool, John Tran, and William Dally. Learning Text-to-image diffusion model on mobile devices within two
both weights and connections for efficient neural network. seconds. NeurIPS, 2023. 2, 3
NeurIPS, 2015. 3 [27] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo,
[11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- Hao Zhang, D. Song, and I. Stoica. Terapipe: Token-level
hard Nessler, and Sepp Hochreiter. Gans trained by a two pipeline parallelism for training large-scale language models.
time-scale update rule converge to a local nash equilibrium. ICML, 2021. 3
NeurIPS, 2017. 6 [28] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu,
[12] Jonathan Ho and Tim Salimans. Classifier-free diffusion Ying Sheng, Xin Jin, Yanping Huang, Z. Chen, Hao Zhang,
guidance. In NeurIPS 2021 Workshop on Deep Generative Joseph E. Gonzalez, and I. Stoica. Alpaserve: Statistical
Models and Downstream Applications, 2021. 3, 8 multiplexing with model parallelism for deep learning serv-
[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- ing. USENIX Symposium on Operating Systems Design and
sion probabilistic models. NeurIPS, 2020. 2, 3 Implementation, 2023. 3
[29] Ji Lin, Wei-Ming Chen, Han Cai, Chuang Gan, and Song Han. [43] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan
Mcunetv2: Memory-efficient patch-based inference for tiny Zhu. Semantic image synthesis with spatially-adaptive nor-
deep learning. In Annual Conference on Neural Information malization. In CVPR, 2019. 6
Processing Systems (NeurIPS), 2021. 3 [44] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On
[30] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, aliased resizing and surprising subtleties in GAN evaluation.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence In IEEE/CVF Conference on Computer Vision and Pattern
Zitnick. Microsoft coco: Common objects in context. In Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24,
Computer Vision–ECCV 2014: 13th European Conference, 2022, pages 11400–11410. IEEE, 2022. 6
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part [45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
V 13, pages 740–755. Springer, 2014. 6 James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
[31] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: an
and Marianna Pensky. Sparse convolutional neural networks. imperative style, high-performance deep learning library. In
In CVPR, 2015. 3 NeurIPS, 2019. 6
[32] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [46] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann,
Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffu- Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.
sion probabilistic model sampling in around 10 steps. arXiv Sdxl: Improving latent diffusion models for high-resolution
preprint arXiv:2206.00927, 2022. 2, 3, 9 image synthesis. In ICLR, 2024. 2, 3, 4, 6
[33] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan [47] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yux-
Li, and Jun Zhu. Dpm-solver++: Fast solver for guided iong He. Zero: Memory optimizations toward training trillion
sampling of diffusion probabilistic models. arXiv preprint parameter models. Sc20: International Conference For High
arXiv:2211.01095, 2022. 2, 9 Performance Computing, Networking, Storage And Analysis,
[34] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang 2019. 3
Zhao. Latent consistency models: Synthesizing high-
[48] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
resolution images with few-step inference. arXiv preprint
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
arXiv: 2310.04378, 2023. 9
Zero-shot text-to-image generation. In ICML, 2021. 2
[35] Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von
[49] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang
and Mark Chen. Hierarchical text-conditional image genera-
Zhao. Lcm-lora: A universal stable-diffusion acceleration
tion with clip latents. arXiv preprint arXiv:2204.06125, 2022.
module. arXiv preprint arXiv: 2311.05556, 2023.
2
[36] Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er-
mon, Jonathan Ho, and Tim Salimans. On distillation of [50] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux-
guided diffusion models. arXiv preprint arXiv:2210.03142, iong He. Deepspeed: System optimizations enable training
2022. 2, 3, 9 deep learning models with over 100 billion parameters. In Pro-
ceedings of the 26th ACM SIGKDD International Conference
[37] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun
on Knowledge Discovery & Data Mining, pages 3505–3506,
Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image
2020. 3
synthesis and editing with stochastic differential equations.
In ICLR, 2022. 6 [51] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi,
[38] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li,
Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gib- and Yuxiong He. Zero-offload: Democratizing billion-scale
bons, and Matei Zaharia. Pipedream: Generalized pipeline model training. In 2021 USENIX Annual Technical Confer-
parallelism for dnn training. In SOSP, 2019. 3 ence, USENIX ATC 2021, July 14-16, 2021, pages 551–564.
[39] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Pat- USENIX Association, 2021. 3
wary, V. Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, [52] Mengye Ren, Andrei Pokrovsky, Bin Yang, and Raquel Ur-
J. Bernauer, Bryan Catanzaro, Amar Phanishayee, and M. tasun. Sbnet: Sparse blocks network for fast inference. In
Zaharia. Efficient large-scale language model training on CVPR, 2018. 3
gpu clusters using megatron-lm. International Conference [53] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Oct-
for High Performance Computing, Networking, Storage and net: Learning deep 3d representations at high resolutions. In
Analysis, 2021. 3 CVPR, 2017. 3
[40] Alexander Quinn Nichol and Prafulla Dhariwal. Improved [54] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
denoising diffusion probabilistic models. In ICML, 2021. 5 Patrick Esser, and Björn Ommer. High-resolution image
[41] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, synthesis with latent diffusion models. In CVPR, 2022. 2, 3,
Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, 6
and Mark Chen. Glide: Towards photorealistic image genera- [55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
tion and editing with text-guided diffusion models. In ICML, net: Convolutional networks for biomedical image segmenta-
2022. 3 tion. In Medical Image Computing and Computer-Assisted
[42] Bowen Pan, Wuwei Lin, Xiaolin Fang, Chaoqin Huang, Bolei Intervention–MICCAI 2015: 18th International Conference,
Zhou, and Cewu Lu. Recurrent residual module for fast Munich, Germany, October 5-9, 2015, Proceedings, Part III
inference in videos. In CVPR, 2018. 3 18, pages 234–241. Springer, 2015. 4
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, [73] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffu-
Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael sion models with exponential integrator. In ICLR, 2022. 2,
Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Pho- 3
torealistic text-to-image diffusion models with deep language [74] Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim:
understanding. NeurIPS, 2022. 2 Generalized denoising diffusion implicit models. 2022. 3
[57] Tim Salimans and Jonathan Ho. Progressive distillation for [75] Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen,
fast sampling of diffusion models. In ICLR, 2021. 2, 3, 9 and Ming yu Liu. Diffcollage: Parallel generation of large
[58] Shaohuai Shi and Xiaowen Chu. Speeding up convolutional content with diffusion models. In CVPR, 2023. 3
neural networks by exploiting the sparsity of rectifier units. [76] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,
arXiv preprint arXiv:1704.07724, 2017. 3 and Oliver Wang. The unreasonable effectiveness of deep
[59] Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, features as a perceptual metric. In CVPR, 2018. 6
and Nima Anari. Parallel sampling of diffusion models. [77] Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-
NeurIPS, 2023. 3, 4, 6, 7 Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle
[60] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling
and Surya Ganguli. Deep unsupervised learning using fully sharded data parallel. arXiv preprint arXiv:2304.11277,
nonequilibrium thermodynamics. In ICML, 2015. 2 2023. 3
[61] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising [78] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang,
diffusion implicit models. In ICLR, 2020. 2, 6, 8 Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu,
[62] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based and {Intra-Operator} parallelism for distributed deep learning.
generative modeling through stochastic differential equations. In 16th USENIX Symposium on Operating Systems Design
In ICLR, 2020. 3 and Implementation (OSDI 22), pages 559–578, 2022. 3
[63] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
Consistency models. 2023. 9
[64] Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, and Song
Han. Torchsparse: Efficient point cloud inference engine. In
MLSys, 2022. 3
[65] Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong-
ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han.
Torchsparse++: Efficient training and inference framework
for sparse convolution on gpus. In MICRO, 2023. 3
[66] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based
generative modeling in latent space. 34:11287–11302, 2021.
3
[67] Leslie G. Valiant. A bridging model for parallel computation.
Commun. ACM, 33(8):103–111, 1990. 3
[68] Yuxin Wu and Kaiming He. Group normalization. In ECCV,
2018. 5
[69] Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo
Durand, and Song Han. Fastcomposer: Tuning-free multi-
subject image generation with localized attention. arXiv,
2023. 5
[70] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling
the generative learning trilemma with denoising diffusion
GANs. In ICLR, 2022. 2
[71] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hecht-
man, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry
Lepikhin, Andy Ly, Marcello Maggioni, Ruoming Pang,
Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and
Zhifeng Chen. Gspmd: General and scalable parallelization
for ml computation graphs. arXiv preprint arXiv: 2105.04663,
2021. 3
[72] Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo,
Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan
Wu, Haoran Zhang, and Jie Zhao. Oneflow: Redesign the
distributed deep learning framework from scratch. arXiv
preprint arXiv: 2110.15032, 2021. 3

You might also like