BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Yang Sui1,2,†Yanyu Li1Anil Kag1Yerlan Idelbayev1Junli Cao1Ju Hu1
Dhritiman Sagar1Bo Yuan2Sergey Tulyakov1Jian Ren1,∗
1Snap Inc.  2Rutgers University
Project Page: https://snap-research.github.io/BitsFusion
Abstract

Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.991.991.991.99 bits, achieving a model with 7.9×7.9\times7.9 × smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.

22footnotetext: Work done during an internship at Snap Inc.11footnotetext: Corresponding author.
Refer to caption
Figure 1: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion, where the weights of UNet are quantized into 1.99 bits, achieving 7.9×7.9\times7.9 × smaller storage than the one from Stable Diffusion v1.5. All the images are synthesized under the setting of using PNDM sampler [45] with 50505050 sampling steps and random seed as 1024102410241024. Prompts and more generations are provided in App. N.

1 Introduction

Recent efforts in developing diffusion-based image generation models [69, 31, 71, 21, 72] have demonstrated remarkable results in synthesizing high-fidelity and photo-realistic images, leading to various applications such as content creation and editing [60, 59, 53, 63, 81, 79, 46, 39], video generation [20, 67, 3, 1, 16, 52, 15], and 3D asset synthesis [78, 42, 56, 66, 57], among others. However, Diffusion Models (DMs) come with the drawback of a large number of parameters, e.g., millions or even billions, causing significant burdens for transferring and storing due to the bulky model size, especially on resource-constrained hardware such as mobile and wearable devices.

Existing studies have explored reducing the model size of large-scale text-to-image diffusion models by designing efficient architectures and network pruning [40, 82, 32]. These approaches usually require significant amounts of training due to the changes made to the pre-trained networks. Another promising direction for model storage reduction is quantization [12, 30], where floating-point weights are converted to low-bit fixed-point representations, thereby saving computation memory and storage.

There have been emerging efforts on compressing the DMs through quantization [65, 37, 17, 38]. However, these approaches still face several major challenges, especially when quantizing large-scale text-to-image diffusion models like Stable Diffusion v1.5 (SD-v1.5) [62]. First, many of these methods are developed on relatively small-scale DMs trained on constrained datasets. For example, models trained on CIFAR-10 require modest storage of around 100100100100 MB [21, 38]. In contrast, SD-v1.5 necessitates 3.443.443.443.44 GB of storage in a full-precision format. Adapting these methods to SD-v1.5 remains to be a challenging problem. Second, current arts mainly focus on quantizing weights to 4444 bits. How to quantize the model to extremely low bit is not well studied. Third, there is a lack of fair and extensive evaluation of how quantization methods perform on large-scale DMs, i.e., SD-v1.5.

To tackle the above challenges, this work proposes BitsFusion, a quantization-aware training framework that employs a series of novel techniques to compress the weights of pre-trained large-scale DMs into extremely low bits (i.e., 1.99 bits), achieving even better performance (i.e., higher image quality and better text-image alignment). Consequently, we compress the 1.721.721.721.72 GB UNet (FP16)111For SD-v1.5, we measure the generation quality using the FP32 format. However, since SD-v1.5 FP16 has similar performance to SD-v1.5 FP32, we use SD-v1.5 FP16 to calculate our compression ratio. of SD-v1.5 into a 219219219219 MB model, achieving a 7.9×7.9\times7.9 × compression ratio. Specifically, our contributions can be summarized into the following four dimensions:

  • Mixed-Precision Quantization for DMs. We propose an effective approach for quantizing DMs in a mixed-precision manner. First, we thoroughly analyze the appropriate metrics to understand the quantization error in the quantized DMs (Sec. 3.2). Second, based on the analysis, we quantize different layers into different bits according to their quantization error (Sec. 3.3).

  • Initialization for Quantized DMs. We introduce several techniques to initialize the quantized model to improve performance, including time embedding pre-computing and caching, adding balance integer, and alternating optimization for scaling factor initialization (Sec. 4.1).

  • Improved Training Pipeline for Quantized DMs. We improve the training pipeline for the quantized model with the proposed two-stage training approach (Sec. 4.2). In the first stage, we use the full-precision model as a teacher to train the quantized model through distillation. Our distillation loss forces the quantized model to learn both the predicted noise and the intermediate features from the teacher network. Furthermore, we adjust the distribution of time step sampling during training, such that the time steps causing larger quantization errors are sampled more frequently. In the second stage, we fine-tune the model using vanilla noise prediction [21].

  • Extensive Quantitative Evaluation. For the first time in the literature, we conduct extensive quantitative analysis to compare the performance of the quantized model against the original SD-v1.5. We include results on various benchmark datasets, i.e., TIFA [25], GenEval [13], CLIP score [58] and FID [19] on MS-COCO 2014 validation set [43]. Additionally, we perform human evaluation on PartiPrompts [77]. Our 1.991.991.991.99-bit weights quantized model consistently outperforms the full-precision model across all the evaluations, demonstrating the effectiveness of our approach.

2 Related Works

To enhance model efficiency in terms of storage and computational costs, quantization is adopted for diffusion models [65, 37, 18, 68, 73, 75, 47, 76, 4, 74, 7, 83, 27, 17, 38] with primarily two types: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ does not require a full training loop; instead, it utilizes a limited calibration dataset to adjust the quantization parameters. For example, PTQ4DM [65] calibrates the quantization parameters to minimize the quantization error of DMs. Q-Diffusion [37] minimizes the quantization error via the block-wise reconstruction [41]. PTQD [18] integrates quantization noise into the stochastic noise inherent in the sampling steps of DMs. TDQ [68] optimizes scaling factors for activations across different time steps, applicable to both PTQ and QAT strategies. TFMQ [27] focuses on reconstructing time embedding and projection layers to prevent over-fitting. However, PTQ often results in performance degradation compared to QAT, particularly when aiming for extremely low-bit DMs. In contrast, QAT involves training the full weights to minimize the quantization error, thereby achieving higher performance compared to PTQ. For instance, EfficientDM [17], inspired by LoRA [24], introduces a quantization-aware low-rank adapter to update the LoRA weights, avoiding training entire weights. Q-DM [38] employs normalization and smoothing operation on attention features through proposed Q-attention blocks, enhancing quantization performance. Nevertheless, existing works primarily study 4444 bits and above quantization on small-scale DMs trained on constrained datasets. In this paper, we focus on quantizing large-scale Stable Diffusion to extremely low bits and extensively evaluating the performance across different benchmark datasets.

3 Mixed Precision Quantization for Diffusion Models

In this section, we first go through the formulations of weight quantization and generative diffusion models. We then determine the mixed-precision strategy, assigning optimized bit widths to different layers to reduce the overall quantization error. Specifically, we first analyze the quantization error of each layer in the diffusion model and conclude sensitivity properties, Then, based on the analysis, we assign appropriate bits to each layer by jointly considering parameter efficiency (i.e., size savings).

3.1 Preliminaries

Quantization is a popular and commonly used technique to reduce model size. While many quantization forms exist, we focus on uniform quantization, where full-precision values are mapped into discrete integer values as follows:

𝜽intsubscript𝜽int\displaystyle\bm{\theta}_{\texttt{int}}bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT =Clip(𝜽fp𝐬+Iz,0,2b1),\displaystyle=\texttt{Clip}(\lfloor\frac{{\bm{\theta}_{\texttt{fp}}}}{\mathbf{% s}}\rceil+I_{z},0,2^{b}-1),= Clip ( ⌊ divide start_ARG bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_ARG start_ARG bold_s end_ARG ⌉ + italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , 0 , 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ) , (1)

where 𝜽fpsubscript𝜽fp\bm{\theta}_{\texttt{fp}}bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT denotes the floating-point weights, 𝜽intsubscript𝜽int\bm{\theta}_{\texttt{int}}bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT is the quantized integer weights, 𝐬𝐬\mathbf{s}bold_s is the scaling factor, Izsubscript𝐼𝑧I_{z}italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the zero point, and b𝑏bitalic_b is the quantization bit-width. delimited-⌊⌉\lfloor\cdot\rceil⌊ ⋅ ⌉ denotes the nearest rounding operation and Clip()Clip\texttt{Clip}(\cdot)Clip ( ⋅ ) denotes the clipping operation that constrains 𝜽intsubscript𝜽int\bm{\theta}_{\texttt{int}}bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT within the target range. Following the common settings [37, 17], we apply the channel-wise quantization and set 8888 bits for the first and last convolutional layer of the UNet.

Stable Diffusion. Denoising diffusion probabilistic models [69, 21] learn to predict real data distribution 𝐱pdatasimilar-to𝐱subscript𝑝data\mathbf{x}\sim p_{\text{data}}bold_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT by reversing the ODE flow. Specifically, given a noisy data sample 𝐳t=αt𝐱+σtϵsubscript𝐳𝑡subscript𝛼𝑡𝐱subscript𝜎𝑡bold-italic-ϵ\mathbf{z}_{t}=\alpha_{t}\mathbf{x}+\sigma_{t}{\bm{\epsilon}}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ (αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are SNR schedules and ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ is the added ground-truth noise), and a quantized denoising model ϵ^𝜽int,𝒔subscriptbold-^bold-italic-ϵsubscript𝜽int𝒔\bm{\hat{\epsilon}_{\theta_{\texttt{int},s}}}overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int bold_, bold_italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT parameterized by 𝜽intsubscript𝜽int\bm{\theta}_{\texttt{int}}bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT and 𝐬𝐬\mathbf{s}bold_s, the learning objective can be formulated as follows,

𝜽int,𝐬subscriptsubscript𝜽int𝐬\displaystyle\mathcal{L}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT =𝔼t,𝐱[ϵϵ^𝜽int,𝐬(t,𝐳t,𝐜)],absentsubscript𝔼𝑡𝐱delimited-[]normbold-italic-ϵsubscriptbold-^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\|\boldsymbol{\epsilon}-% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}(t,% \mathbf{z}_{t},\mathbf{c})\|\right],= blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ ] , (2)

where t𝑡titalic_t is the sampled time step and 𝐜𝐜\mathbf{c}bold_c is the input condition (e.g., text embedding). Note that during the training of quantized model, we optimize 𝜽fpsubscript𝜽fp\bm{\theta}_{\texttt{fp}}bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT and 𝐬𝐬\mathbf{s}bold_s by backpropagating 𝜽int,𝐬subscriptsubscript𝜽int𝐬\mathcal{L}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT via Straight-Through Estimator (STE) [2] and quantize the weights to the integers for deployment. Here, for the notation simplicity, we directly use 𝜽intsubscript𝜽int\bm{\theta}_{\texttt{int}}bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT to represent the optimized weights in the quantized models.

The latent diffusion model [62] such as Stable Diffusion conducts the denoising process in the latent space encoded by variational autoencoder (VAE) [34, 61], where the diffusion model is the UNet [9]. This work mainly studies the quantization for the UNet model, given it is the major bottleneck for the storage and runtime of the Stable Diffusion [40]. During the inference time, classifier-free guidance (CFG) [22] is usually applied to improve the generation,

ϵ~𝜽int,𝐬(t,𝐳t,𝐜)=wϵ^𝜽int,𝐬(t,𝐳t,𝐜)(w1)ϵ^𝜽int,𝐬(t,𝐳t,),subscript~bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜𝑤subscript^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜𝑤1subscript^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡\small\tilde{\bm{\epsilon}}_{\bm{\theta_{\texttt{int}}},\mathbf{s}}(t,\mathbf{% z}_{t},\mathbf{c})=w\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{int}},\mathbf{s}% }(t,\mathbf{z}_{t},\mathbf{c})-(w-1)\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{% int}},\mathbf{s}}(t,\mathbf{z}_{t},\varnothing),over~ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) = italic_w over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - ( italic_w - 1 ) over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∅ ) , (3)

where w1𝑤1w\geq 1italic_w ≥ 1 and ϵ^𝜽int,𝐬(t,𝐳t,)subscript^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{t},\varnothing)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∅ ) denotes the generation conditioned on the null text prompt \varnothing.

3.2 Per-Layer Quantization Error Analysis

Obtaining Quantized Models. We first perform a per-layer sensitivity analysis for the diffusion model. Specifically, given a pre-trained full-precision diffusion model, we quantize each layer to 1111, 2222, and 3333 bits while freezing others at full-precision, and performing quantization-aware training (QAT) respectively. For instance, for the SD-v1.5 UNet with 256256256256 layers (excluding time embedding, the first and last layers), we get a total of 768768768768 quantized candidates. We perform QAT over each candidate on a pre-defined training sub dataset, and validate the incurred quantization error of each candidate by comparing it against the full-precision model (more details in App. B).

Refer to caption
SD-v1.5 CA toq CA tok CA tov CA tok CA tok RB conv RB conv shortcut CA tok RB conv shortcut
(a) Left most column shows the images synthesized by SD-v1.5 FP32 and other columns show images generated by the quantized models, where only one layer is quantized (e.g., CA toq denotes the cross-attention layer for Query projection is quantized and RB conv shortcut denotes the Convolution Shotcut layer in Residual Block is quantized. The quantized layers follow the same order of highlighted layers in (b) and (c), from left to right. Quantizing the layers impact both the image quality (as in RB conv shortcut) and text-image alignment (e.g., the teddy bear disappears after quantizing some CA tok layers).
Refer to caption
(b) MSE value by quantizing layers in SD-v1.5.
Refer to caption
(c) CLIP score drop by quantizing layers in SD-v1.5.
Figure 2: 1111-bit quantization error analysis for all the layers from the UNet of SD-v1.5.

Measuring Quantization Errors. To find the appropriate way to interpret the quantization error, we analyze four metrics: Mean-Squared-Error (MSE) that quantifies the pixel-level discrepancies between images (generations from floating and the quantized model in our case), LPIPS [80] that assesses human-like perceptual similarity judgments, PSNR [23] that measures image quality by comparing the maximum possible power of a signal with the power of a corrupted noise, and CLIP score [58] that evaluates the correlation between an image and its language description. After collecting the scores (examples in Fig. 2(b) and Fig. 2(c), full metrics are listed in App. F), we further measure the consistency of them by calculating the Pearson correlation [8] for different metrics under the same bit widths (in Tab. 2), and different bit widths under the same metric (in Tab. 2). With these empirical results, we draw the following two main observations.

Observation 1: MSE, PSNR, and LPIPS show strong correlation and they correlate well with the visual perception of image quality.

Tab. 2 shows that MSE is highly correlated with PSNR and LPIPS under the same bit width. Additionally, we observe a similar trend of per-layer quantization error under different bit widths, as in Tab. 2. As for visual qualities in Fig. 2(a) and 2(b), we can see that higher MSE errors lead to severe image quality degradation, e.g., the highlighted RB conv shortcut. Therefore, the MSE metric effectively reflects quality degradations incurred by quantization, and it is unnecessary to incorporate PSNR and LPIPS further.

Observation 2: After low-bit quantization, changes in CLIP score are not consistently correlated with MSE across different layers. Although some layers show smaller MSE, they may experience larger semantic degradation, reflected in larger CLIP score changes.

We notice that, after quantization, the CLIP score changes for all layers only have a weak correlation with MSE, illustrated in Tab. 2. Some layers display smaller MSE but larger changes in CLIP score. For example, in Fig. 2(b), the MSE of CA tok layer (5th highlighted layer (green) from left to right) is less than that of RB conv layer (6th highlighted layer (orange) from left to right), yet the changes in CLIP score are the opposite. As observed in the first row of Fig. 2(a), compared to RB conv layer, quantizing this CA tok layer changes the image content from "a teddy bear" to "a person", which diverges from the text prompt A teddy bear on a skateboard in Times Square, doing tricks on a cardboard box ramp. This occurs because MSE measures only the difference between two images, which does not capture the semantic degradation. In contrast, the CLIP score reflects the quantization error in terms of semantic information between the text and image. Thus, we employ the CLIP score as a complementary metric to represent the quantization error.

Table 1: Pearson correlation (absolute value) of quantization error between different metrics (e.g., MSE vs. PSNR denotes the correlation between two metrics) when quantizing individual layers to 1, 2, and 3 bits. CS denotes CLIP Score.
MSE vs. PSNR MSE vs. LPIPS MSE vs. CS
1 bit 0.870 0.984 0.733
2 bit 0.882 0.989 0.473
3 bit 0.869 0.991 0.535
Table 2: Pearson correlation (absolute value) of quantization error between different bit pairs (e.g., 1 vs. 2 denotes the correlation between the two bit widths) for a single metric when quantizing individual layers to 1, 2, and 3 bits.
MSE PSNR LPIPS CLIP Score
1 vs. 2 bit 0.929 0.954 0.943 0.504
1 vs. 3 bit 0.766 0.843 0.802 0.344
2 vs. 3 bit 0.887 0.923 0.895 0.428

3.3 Deciding the Optimal Precision

With the above observations, we then develop the strategy for bit-width assignments. We select MSE and CLIP as our quantitative metrics, along with the number of parameters of each layer as the indicator of size savings.

Assigning bits based on MSE. Intuitively, layers with more parameters and lower quantization error are better candidates for extremely low-bit quantization, as the overall bit widths of the model can be significantly reduced. According to this, we propose a layer size-aware sensitivity score 𝒮𝒮\mathcal{S}caligraphic_S. For the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer, its sensitivity score for the b𝑏bitalic_b-bits (b{1,2,3}𝑏123b\in\{1,2,3\}italic_b ∈ { 1 , 2 , 3 }) is defined as 𝒮i,b=Mi,bNiηsubscript𝒮𝑖𝑏subscript𝑀𝑖𝑏subscriptsuperscript𝑁𝜂𝑖\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - italic_η end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where M𝑀Mitalic_M denotes the MSE error, N𝑁Nitalic_N is the total number of parameters of the layer, and η[0,1]𝜂01\eta\in[0,1]italic_η ∈ [ 0 , 1 ] denotes the parameter size factor. To determine the bit width (i.e., bsuperscript𝑏b^{*}italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) for each layer, we define a sensitivity threshold as 𝒮osubscript𝒮𝑜\mathcal{S}_{o}caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer is assigned to bisuperscriptsubscript𝑏𝑖b_{i}^{*}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-bits, where bi=min{b|𝒮i,b<𝒮o}superscriptsubscript𝑏𝑖conditional𝑏subscript𝒮𝑖𝑏subscript𝒮𝑜b_{i}^{*}=\min\{b|\mathcal{S}_{i,b}<\mathcal{S}_{o}\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min { italic_b | caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT < caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT }. The remaining layers are 4444 bits.

Assigning bits based on CLIP score. For the layers with a high CLIP score dropping after quantization, instead of assigning bits based on sensitivity score as discussed above, we directly assign higher bits to those layers. Therefore, the quantized model can produce content that aligns with the semantic information of the prompt. We provide the detailed mixed-precision algorithm in Alg. 1 of App. B.

4 Training Extreme Low-bit Diffusion Model

With the bits of each layer decided, we then train the quantized model with a series of techniques to improve performance. The overview of our approach is illustrated in Fig. 3.

4.1 Initializing the Low-bit Diffusion Model

Time Embedding Pre-computing and Caching. During the inference time of a diffusion model, a time step t𝑡titalic_t is transformed into an embedding through projection layers to be incorporated into the diffusion model. As mentioned by existing works [27], the quantization of the projection layers can lead to large quantization errors. However, the embedding from each time step t𝑡titalic_t is always the same, suggesting that we can actually pre-compute the embedding offline and load cached values during inference, instead of computing the embedding every time. Furthermore, the storage size of the time embedding is 25.6×25.6\times25.6 × smaller than the projection layers. Therefore, we pre-compute the time embedding and save the model without the project layers. More details are provided in App. C.

Adding Balance Integer. In general, weight distributions in deep neural networks are observed as symmetric around zero [84]. To validate the assumption on SD-v1.5, we analyze its weight distribution for the layers under full precision by calculating the skewness of weights. Notably, the skewness of more than 97%percent9797\%97 % of the layers ranges between [0.5,0.5]0.50.5\left[-0.5,0.5\right][ - 0.5 , 0.5 ], indicating that the weight distributions are symmetric in almost all layers. Further details are provided in App. D.

However, existing works on diffusion model quantization overlook the symmetric property [37, 65, 38], as they perform relatively higher bits quantization, e.g., 4444 or 8888 bits. This will hurt the model performance at extremely low bit levels. For example, in 1111-bit quantization, the possible most symmetric integer outcomes can only be {0,1}01\{0,1\}{ 0 , 1 } or {1,0}10\{-1,0\}{ - 1 , 0 }. Similarly, for 2222-bit quantization, the most balanced mapping integers can be either {2,1,0,1}2101\{-2,-1,0,1\}{ - 2 , - 1 , 0 , 1 } or {1,0,1,2}1012\{-1,0,1,2\}{ - 1 , 0 , 1 , 2 }, significantly disrupting the symmetric property. The absence of a single value among 2222 or 4444 numbers under low-bit quantization can have a significant impact. To tackle this, we leverage the bit balance strategy [36, 51] to initialize the model. Specifically, we introduce an additional value to balance the original quantization values. Namely, in a 1-bit model, we adjust the candidate integer set from {0,1}01\{0,1\}{ 0 , 1 } to {1,0,1}101\{-1,0,1\}{ - 1 , 0 , 1 }, achieving a more balanced distribution. By doing so, we treat the balanced n𝑛nitalic_n-bits weights as log(2n+1)logsuperscript2𝑛1\texttt{log}(2^{n}+1)log ( 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 )-bits.

Scaling Factor Initialization via Alternating Optimization. Initializing scaling factors is an important step in quantization. Existing QAT works typically employ the Min-Max initialization strategy [17, 48] to ensure the outliers are adequately represented and preserved. However, such a method faces challenges in extremely low-bit quantization settings like 1111-bit, since the distribution of the full-precision weights is overlooked, leading to a large quantization error and the increased difficulty to converge. Therefore, we aim to minimize the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT error between the quantized weights and full-precision weights with the optimization objective as:

min𝐬𝐬(𝜽intIz)𝜽fp2.subscript𝐬superscriptnorm𝐬subscript𝜽intsubscript𝐼𝑧subscript𝜽fp2\small\min_{\mathbf{s}}\|\mathbf{s}\cdot(\bm{\theta}_{\texttt{int}}-I_{z})-\bm% {\theta}_{\texttt{fp}}\|^{2}.roman_min start_POSTSUBSCRIPT bold_s end_POSTSUBSCRIPT ∥ bold_s ⋅ ( bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) - bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

Nevertheless, considering the rounding operation, calculating an exact closed-form solution is not straightforward [29]. Inspired by the Lloyd-Max algorithm [28, 49], we use an optimization method on scaling factor 𝐬𝐬\mathbf{s}bold_s to minimize the initialization error of our quantized diffusion model as follows:

𝜽intj=Qint(𝜽fp,𝐬j1);𝐬j=𝜽fpj(𝜽intjIz)(𝜽intjIz)(𝜽intjIz),formulae-sequencesuperscriptsubscript𝜽int𝑗subscript𝑄intsubscript𝜽fpsuperscript𝐬𝑗1superscript𝐬𝑗superscriptsubscript𝜽fp𝑗superscriptsuperscriptsubscript𝜽int𝑗subscript𝐼𝑧superscriptsubscript𝜽int𝑗subscript𝐼𝑧superscriptsuperscriptsubscript𝜽int𝑗subscript𝐼𝑧\displaystyle\bm{\theta}_{\texttt{int}}^{j}=Q_{\texttt{int}}(\bm{\theta}_{% \texttt{fp}},\mathbf{s}^{j-1});\,\,\,\mathbf{s}^{j}=\frac{\bm{\theta}_{\texttt% {fp}}^{j}(\bm{\theta}_{\texttt{int}}^{j}-I_{z})^{\intercal}}{(\bm{\theta}_{% \texttt{int}}^{j}-I_{z})(\bm{\theta}_{\texttt{int}}^{j}-I_{z})^{\intercal}},bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = italic_Q start_POSTSUBSCRIPT int end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT , bold_s start_POSTSUPERSCRIPT italic_j - 1 end_POSTSUPERSCRIPT ) ; bold_s start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = divide start_ARG bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG ( bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ( bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG , (5)

where Qint()subscript𝑄intQ_{\texttt{int}}(\cdot)italic_Q start_POSTSUBSCRIPT int end_POSTSUBSCRIPT ( ⋅ ) denotes the integer mapping quantization operation that converts the full-precision weights to integer as Eq. (1), and j𝑗jitalic_j represents the iterative step. The optimization is done for 10101010 steps.

Refer to caption
Figure 3: Overview of the training and inference pipeline for the proposed BitsFusion. Left: We analyze the quantization error for each layer in SD-v1.5 (Sec. 3.2) and derive the mixed-precision recipe (Sec. 3.3) to assign different bit widths to different layers. We then initialize the quantized UNet by adding a balance integer, pre-computing and caching the time embedding, and alternately optimizing the scaling factor (Sec. 4.1). Middle: During the Stage-I training, we freeze the teacher model (i.e., SD-v1.5) and optimize the quantized UNet through CFG-aware quantization distillation and feature distillation losses, along with sampling time steps by considering quantization errors (Sec. 4.2). During the Stage-II training, we fine-tune the previous model with the noise prediction. Right: For the inference stage, using the pre-cached time features, our model processes text prompts and generates high-quality images.

4.2 Two-Stage Training Pipeline

With the mixed-precision model initialized, we introduce the two-stage training pipeline. In Stage-I, we train the quantized model using the full-precision model as the teacher through distillation loss. In Stage-II, we fine-tune the model from the previous stage using noise prediction [21, 72].

CFG-aware Quantization Distillation. Similar to existing works [11], we fine-tune the quantized diffusion model to improve the performance. Here both the weights and scaling factors are optimized. Additionally, we notice that training the quantized model in a distillation fashion using the full-precision model yields better performance than training directly with vanilla noise prediction. Furthermore, during distillation, it is crucial for the quantized model to be aware of CFG, i.e., text dropping is applied during distillation. Specifically, our training objective is as follows:

𝜽int,𝐬noise=𝔼t,𝐱[ϵ^𝜽fp(t,𝐳t,𝐜)ϵ^𝜽int,𝐬(t,𝐳t,𝐜)],𝐜=ifPU[0,1]<pelse𝐜,formulae-sequencesuperscriptsubscriptsubscript𝜽int𝐬noisesubscript𝔼𝑡𝐱delimited-[]normsubscriptbold-^bold-italic-ϵsubscript𝜽fp𝑡subscript𝐳𝑡𝐜subscriptbold-^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜𝐜if𝑃similar-to𝑈01𝑝else𝐜\displaystyle\mathcal{L}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}^{% \texttt{noise}}=\mathbb{E}_{t,\mathbf{x}}\left[\|\boldsymbol{\hat{\epsilon}}_{% \boldsymbol{\theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\boldsymbol{% \hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{% t},\mathbf{c})\|\right],\mathbf{c}=\varnothing\ \textrm{if}\ P\sim U[0,1]<p\ % \textrm{else}\ \mathbf{c},caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ ∥ overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ ] , bold_c = ∅ if italic_P ∼ italic_U [ 0 , 1 ] < italic_p else bold_c , (6)

where P𝑃Pitalic_P controls the text dropping probability during training and p𝑝pitalic_p is set as 0.10.10.10.1.

Feature Distillation. To further improve the generation quality of the quantized model, we distill the full-precision model at a more fine-grained level through feature distillation [32] as follows:

𝜽int,𝐬feat=𝔼t,𝐱[𝜽fp(t,𝐳t,𝐜)𝜽int,𝐬(t,𝐳t,𝐜)],superscriptsubscriptsubscript𝜽int𝐬featsubscript𝔼𝑡𝐱delimited-[]normsubscriptsubscript𝜽fp𝑡subscript𝐳𝑡𝐜subscriptsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜\displaystyle\mathcal{L}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}^{% \texttt{feat}}=\mathbb{E}_{t,\mathbf{x}}\left[\|\mathcal{F}_{\boldsymbol{% \theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\mathcal{F}_{\boldsymbol{% \theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{t},\mathbf{c})\|\right],caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT feat end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ ∥ caligraphic_F start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - caligraphic_F start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ ] , (7)

where 𝜽()subscript𝜽\mathcal{F}_{\boldsymbol{\theta}}(\cdot)caligraphic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) denotes the operation for getting features from the Down and Up blocks in UNet. We then have the overall distillation loss distsuperscriptdist\mathcal{L}^{\texttt{dist}}caligraphic_L start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT in Stage-I as follows:

dist=𝜽int,𝐬noise+λ𝜽int,𝐬feat,superscriptdistsuperscriptsubscriptsubscript𝜽int𝐬noise𝜆superscriptsubscriptsubscript𝜽int𝐬feat\displaystyle\mathcal{L}^{\texttt{dist}}=\mathcal{L}_{\boldsymbol{\theta}_{% \texttt{int}},\mathbf{s}}^{\texttt{noise}}+\lambda\mathcal{L}_{\boldsymbol{% \theta}_{\texttt{int}},\mathbf{s}}^{\texttt{feat}},caligraphic_L start_POSTSUPERSCRIPT dist end_POSTSUPERSCRIPT = caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT noise end_POSTSUPERSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT feat end_POSTSUPERSCRIPT , (8)

where λ𝜆\lambdaitalic_λ is empirically set as 0.010.010.010.01 to balance the magnitude of the two loss functions.

Refer to caption
Figure 4: More time steps are sampled towards where larger quantization error occurs.

Quantization Error-aware Time Step Sampling. The training of diffusion models requires sampling different time steps in each optimization iteration. We explore how to adjust the strategy for time step sampling such that the quantization error in each time step can be effectively reduced during training. We first train a 1.991.991.991.99-bit quantized model with Eq. (8). Then, we calculate the difference of the predicted latent features between the quantized model and the full-precision model as 𝔼t,𝐱[1α¯tα¯tϵ^𝜽fp(t,𝐳t,𝐜)ϵ^𝜽int,𝐬(t,𝐳t,𝐜)2]subscript𝔼𝑡𝐱delimited-[]1subscript¯𝛼𝑡subscript¯𝛼𝑡superscriptnormsubscriptbold-^bold-italic-ϵsubscript𝜽fp𝑡subscript𝐳𝑡𝐜subscriptbold-^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜2\mathbb{E}_{t,\mathbf{x}}[\frac{1-\bar{\alpha}_{t}}{\bar{\alpha}_{t}}\|% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{fp}}}(t,\mathbf{z}_{% t},\mathbf{c})-\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int}}% ,\mathbf{s}}(t,\mathbf{z}_{t},\mathbf{c})\|^{2}]blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int end_POSTSUBSCRIPT , bold_s end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where t[0,1,,999]𝑡01999t\in\left[0,1,\cdots,999\right]italic_t ∈ [ 0 , 1 , ⋯ , 999 ] and α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noise scheduler (detailed derivation in App. E). The evaluation is conducted on a dataset with 128128128128 image-text pairs. Fig. 4 shows the quantization error does not distribute equally across all time steps. Notably, the quantization error keeps increasing as the time steps approach t=999𝑡999t=999italic_t = 999.

To mitigate the quantization error prevalent near the time steps t=999𝑡999t=999italic_t = 999, we propose a sampling strategy by utilizing a distribution specifically tailored to sample more time steps exhibiting the largest quantization errors, thereby enhancing performance. To achieve this goal, we leverage the Beta distribution. Specifically, time steps are sampled according to tBeta(α,β)similar-to𝑡Beta𝛼𝛽t\sim\textit{Beta}(\alpha,\beta)italic_t ∼ Beta ( italic_α , italic_β ), as shown in Fig. 4. We empirically set α=3.0𝛼3.0\alpha=3.0italic_α = 3.0 and β=1.0𝛽1.0\beta=1.0italic_β = 1.0 for the best performance. Combining the strategy of time steps sampling with Eq. (8), we conduct the Stage-I training.

Fine-tuning with Noise Prediction. After getting the model trained with the distillation loss in Stage-I, we then fine-tune it with noise prediction, as in Eq. (2), in Stage-II. We apply a text dropping with probability as 10%percent1010\%10 % and modify the distribution of time step sampling based on the quantization error, as introduced above. The reason we leverage two-stage fine-tuning, instead of combining Stage-I and Stage-II, is that we observe more stabilized training results.

5 Experiments

Implementation Details. We develop our code using diffusers library222https://github.com/huggingface/diffusers and train the models with AdamW optimizer [33] and a constant learning rate as 1111e0505-05- 05 on an internal dataset. For Stage-I, we use 8888 NVIDIA A100 GPUs with a total batch size of 256256256256 to train the quantized model for 20202020K iterations. For Stage-II, we use 32323232 NVIDIA A100 GPUs with a total batch size of 1024102410241024 to train the quantized model for 50505050K iterations. During inference, we adopt the PNDM scheduler [45] with 50505050 sampling steps to generate images for comparison. Other sampling approaches (e.g., DDIM [70] and DPMSolver [50]) lead to the same conclusion (App. K).

Evaluation Metrics. We conduct evaluation on CLIP Score and FID on MS-COCO [44], TIFA [26], GenEval [14], and human evaluation on PartiPrompts [77]. We adopt ViT-B/32 model [10] in CLIP score and the Mask2Former(Swin-S-8×\times×2) [5] in GenEval. App. I provides details for the metrics.

5.1 Main Results

Comparison with SD-v1.5. Our quantized 1.991.991.991.99-bits UNet consistently outperforms the full-precision model across all metrics.

  • 30303030K MS-COCO 2014 Validation Set. For the CLIP score, as demonstrated in Fig. 5(a), attributed to the proposed mixed-precision recipe with the introduced initialization techniques and advanced training schemes in Stage-I, our 1.991.991.991.99-bits UNet, with a storage size of 219219219219MB, achieves performance comparable to the original SD-v1.5. Following Stage-II training, our model surpasses the performance of the original SD-v1.5. With CFG scales ranging from 2.52.52.52.5 to 9.59.59.59.5, our model yields 0.0020.003similar-to0.0020.0030.002\sim 0.0030.002 ∼ 0.003 higher CLIP scores. We provide a discussion of FID in Appendix L.

  • TIFA. As shown in Fig. 5(b), our 1.991.991.991.99-bits model with Stage-I training performs comparably to the SD-v1.5. With the Stage-II training, our model achieves better metrics over the SD-v1.5.

  • GenEval. We show the comparison results for GenEval in Fig. 5(c) (detailed comparisons of GenEval score are presented in Appn. M). Our model outperforms SD-v1.5 for all CFG scales.

  • Human Evaluation. More users prefer the images generated by our quantized model over SD-v1.5, with the ratio as 54.4%percent54.454.4\%54.4 %. We provide a detailed comparison in App. J.

Comparison with Other Quantization Approaches. Additionally, we conduct the experiments by comparing our approach with other works including LSQ [11], Q-Diffusion [37], EfficientDM [17], and Apple-MBP [54], as shown in Tab. 4. Our model achieves a higher CLIP score compared with all other works and better performance than SD-v1.5.

Refer to caption
(a) CLIP score on MS-COCO.
Refer to caption
(b) Evaluation on TIFA.
Refer to caption
(c) Evaluation on GenEval.
Figure 5: Comparison between our 1.991.991.991.99-bits model vs. SD-v1.5 on various evaluation metrics with CFG scales ranging from 2.52.52.52.5 to 9.59.59.59.5. Ours-I denotes the model with Stage-I training and Ours-II denotes the model with Stage-II training.
Table 3: Comparison with existing quantization methods, including LSQ [11], Q-Diffusion [37], EfficientDM [17], and Apple-MBP [54]. The CLIP score is measured on 1K PartiPrompts.
Method Bit-width CLIP score
SD-v1.5 32 0.3175
LSQ 2 0.2849
Q-Diffusion 4 0.3137
EfficientDM 2 0.2918
Apple-MBP 2 0.3023
Ours 1.99 0.3212
Table 4: Analysis of our proposed methods measured under various CFG scales, i.e., 3.53.53.53.5, 5.55.55.55.5, 7.57.57.57.5, and 9.59.59.59.5. We use LSQ [11] as the basic QAT method, which involves the training of weights and scaling factors of a uniformly 2222-bit quantized UNet. Then, we gradually introduce each proposed technique to evaluate their effectiveness. CLIP scores are measured on 1K PartiPrompts.
Method Bit-width 3.5 5.5 7.5 9.5 Average ΔΔ\Deltaroman_Δ
SD-v1.5 32 0.3110 0.3159 0.3175 0.3180 0.3156 -
QAT-Base 2 0.2679 0.2793 0.2849 0.2868 0.2797 -
+Balance 2.32 0.2990 0.3059 0.3080 0.3086 0.3054 +0.0257
+Alternating Opt. 2.32 0.3061 0.3108 0.3117 0.3115 0.3100 +0.0046
+Mixed/Caching 1.99 0.3055 0.3129 0.3142 0.3145 0.3118 +0.0018
+Feat Dist. 1.99 0.3086 0.3147 0.3167 0.3169 0.3142 +0.0024
+Time Sampling 1.99 0.3098 0.3159 0.3181 0.3184 0.3156 +0.0014
+Fine-tuning 1.99 0.3163 0.3192 0.3212 0.3205 0.3183 +0.0027

5.2 Ablation Analysis

Here we perform extensive analysis for our proposed method. We mainly evaluate different experimental settings using the CLIP score measured on 1K PartiPrompts [77].

Analysis of the Proposed Techniques. We adopt the LSQ [11] as the basic QAT method to update the weights and scaling factors of a uniform 2-bit UNet with Min-Max initialization. Results are presented in Tab. 4 with the following details:

  • +Balance. By adding a balance integer, a 2222-bit model that typically represents 4444 integer values can now represent 5555 integers, becoming a 2.322.322.322.32-bit model by log(4+1)log41\texttt{log}(4+1)log ( 4 + 1 ) bits. The average CLIP score has significantly increased from 0.27970.27970.27970.2797 to 0.30540.30540.30540.3054.

  • +Alternating Opt. By further utilizing the scaling factor initialization via alternating optimization, the average CLIP score of the 2.322.322.322.32-bit model increases to 0.31000.31000.31000.3100.

  • +Mixed/Caching. By leveraging time embedding pre-computing and caching, we minimize the storage requirements for time embedding and projection layers by only retaining the calculated features. This significantly reduces the averaged bits. Combined with our mixed-precision strategy, this approach reduces the average bits from 2.322.322.322.32 to 1.991.991.991.99 bits and can even improve the performance, i.e., CLIP score improved from 0.31000.31000.31000.3100 to 0.31180.31180.31180.3118.

  • +Feat Dist. By incorporating the feature distillation loss, i.e., Eq. (7), the model can learn more fine-grained information from the teacher model, improving CLIP score from 0.31180.31180.31180.3118 to 0.31420.31420.31420.3142.

  • +Time Sampling. By employing a quantization error-aware sampling strategy at various time steps, the model focuses more on the time step near t=999𝑡999t=999italic_t = 999. With this sampling strategy, our 1.991.991.991.99-bits model performs very closely to, or even outperforms, the original SD-v1.5.

  • +Fine-tuning. By continuing with Stage-II training that incorporates noise prediction, our 1.991.991.991.99-bits model consistently outperforms the SD-v1.5 across various guidance scales, improving the CLIP score to 0.31830.31830.31830.3183.

Table 5: Analysis of η𝜂\etaitalic_η in the mixed-precision strategy.
η𝜂\etaitalic_η 0 0.1 0.2 0.3 0.4 0.5
CLIP score 0.3155 0.3173 0.3162 0.3181 0.3171 0.3168
Table 6: Anlysis of λ𝜆\lambdaitalic_λ in distillation loss.
λ𝜆\lambdaitalic_λ 1 0.1 0.01
CLIP score 0.3164 0.3159 0.3181
Table 7: Analysis of α𝛼\alphaitalic_α in time step-aware sampling.
α𝛼\alphaitalic_α 1.5 2.0 3.0
CLIP score 0.3169 0.3173 0.3181

Effect of η𝜂\etaitalic_η in Mixed-Precision Strategy. Tab. 7 illustrates the impact of the parameter size factor η𝜂\etaitalic_η (as discussed in Sec. 3.3) in determining the optimal mixed precision strategy. We generate six different mixed precision recipes with different η𝜂\etaitalic_η with 20K training iterations for comparisons. Initially, we explore the mixed precision strategy determined with and without the parameter size factor. Setting η=0𝜂0\eta=0italic_η = 0 results in Nη=1superscript𝑁𝜂1N^{-\eta}=1italic_N start_POSTSUPERSCRIPT - italic_η end_POSTSUPERSCRIPT = 1, indicating that the mixed precision is determined without considering the impact of parameter size. The results show that neglecting the parameter size significantly degrades performance. Further, we empirically choose η=0.3𝜂0.3\eta=0.3italic_η = 0.3 in our experiments after comparing different values of η𝜂\etaitalic_η.

Effect of λ𝜆\lambdaitalic_λ of Distillation Loss. Tab. 7 illustrates the impact of the balance factor λ𝜆\lambdaitalic_λ for loss functions in Eq. (8). We empirically choose λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01 in our experiments after comparing the performance.

Effect of α𝛼\alphaitalic_α in Time Step-aware Sampling Strategy. Tab. 7 illustrates the impact of the α𝛼\alphaitalic_α for different Beta sampling distribution. As analyzed in Sec. 4.2, the quantization error increases near t=999𝑡999t=999italic_t = 999. To increase sampling probability near this time step, Beta distribution requires α>1𝛼1\alpha>1italic_α > 1 with β=1𝛽1\beta=1italic_β = 1. A larger α𝛼\alphaitalic_α enhances the sampling probability near t=999𝑡999t=999italic_t = 999. Compared to α=1.5𝛼1.5\alpha=1.5italic_α = 1.5 and α=2.0𝛼2.0\alpha=2.0italic_α = 2.0, α=3.0𝛼3.0\alpha=3.0italic_α = 3.0 concentrates more on later time steps and achieves the best performance. We choose α=3.0𝛼3.0\alpha=3.0italic_α = 3.0 in our experiments.

Analysis for Different Schedulers. One advantage of our training-based quantization approach is that our quantized model consistently outperforms SD-v1.5 for various sampling approaches. We conduct extensive evaluations on TIFA to show we achieve better performance than SD-v1.5 for using both DDIM [70] and DPMSolver [50] to perform the sampling. More details are shown in App. K.

6 Conclusion

To enhance the storage efficiency of the large-scale diffusion models, we introduce an advanced weight quantization framework, BitsFusion, which effectively compresses the weights of UNet from SD-v1.5 to 1.991.991.991.99 bits, achieving a 7.9×7.9\times7.9 × smaller model size. BitsFusion even outperforms SD-v1.5 in terms of generation quality. Specifically, we first conduct a comprehensive analysis to understand the impact of each layer during quantization and establish a mixed-precision strategy. Second, we propose a series of effective techniques to initialize the quantized model. Third, during the training stage, we enforce the quantized model to learn the full-precision SD-v1.5 by using distillation losses with the adjusted distribution of time step sampling. Finally, we fine-tune the previous quantized model through vanilla noise prediction. Our extensive evaluations on TIFA, GenEval, CLIP score, and human evaluation consistently demonstrate the advantage of BitsFusion over full-precision SD-v1.5.

References

  • [1] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
  • [2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [3] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • [4] Hanwen Chang, Haihao Shen, Yiyang Cai, Xinyu Ye, Zhenzhong Xu, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, Yintong Lu, and Heng Guo. Effective quantization for diffusion models on cpus. arXiv preprint arXiv:2311.16133, 2023.
  • [5] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
  • [6] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3043–3054, 2023.
  • [7] Huanpeng Chu, Wei Wu, Chengjie Zang, and Kun Yuan. Qncd: Quantization noise correction for diffusion models. arXiv preprint arXiv:2403.19140, 2024.
  • [8] Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. Noise reduction in speech processing, pages 1–4, 2009.
  • [9] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • [11] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. In International Conference on Learning Representations, 2019.
  • [12] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  • [13] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • [14] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • [15] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  • [16] Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  • [17] Yefei He, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
  • [18] Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • [19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [20] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  • [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • [22] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • [23] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
  • [24] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • [25] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  • [26] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023.
  • [27] Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, and Xianglong Liu. Tfmq-dm: Temporal feature maintenance quantization for diffusion models. arXiv preprint arXiv:2311.16503, 2023.
  • [28] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6. IEEE, 2014.
  • [29] Yerlan Idelbayev, Pavlo Molchanov, Maying Shen, Hongxu Yin, Miguel A Carreira-Perpinán, and Jose M Alvarez. Optimal quantization using scaled codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12095–12104, 2021.
  • [30] Qing Jin, Jian Ren, Richard Zhuang, Sumant Hanumante, Zhengang Li, Zhiyu Chen, Yanzhi Wang, Kaiyuan Yang, and Sergey Tulyakov. F8net: Fixed-point 8-bit only multiplication for network quantization. arXiv preprint arXiv:2202.05239, 2022.
  • [31] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  • [32] Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
  • [33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [35] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  • [36] Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
  • [37] Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023.
  • [38] Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-dm: An efficient low-bit quantized diffusion model. Advances in Neural Information Processing Systems, 36, 2024.
  • [39] Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Textcraftor: Your text encoder can be image quality controller. arXiv preprint arXiv:2403.18978, 2024.
  • [40] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems, 36, 2024.
  • [41] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2020.
  • [42] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
  • [43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [45] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  • [46] Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, and Sergey Tulyakov. Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023.
  • [47] Xuewen Liu, Zhikai Li, Junrui Xiao, and Qingyi Gu. Enhanced distribution alignment for post-training quantization of diffusion models. arXiv preprint arXiv:2401.04585, 2024.
  • [48] Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
  • [49] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • [50] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • [51] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
  • [52] Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. arXiv preprint arXiv:2402.14797, 2024.
  • [53] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [54] Atila Orhon, Michael Siracusa, and Aseem Wadhwa. Stable diffusion with core ml on apple silicon, 2022.
  • [55] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2023.
  • [56] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • [57] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  • [58] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [59] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [60] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  • [61] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
  • [62] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [63] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  • [64] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • [65] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
  • [66] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. In The Twelfth International Conference on Learning Representations, 2023.
  • [67] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  • [68] Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park. Temporal dynamic quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • [69] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [70] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [71] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • [72] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • [73] Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, and Wenwu Zhu. Post-training quantization with progressive calibration and activation relaxing for text-to-image diffusion models. arXiv preprint arXiv:2311.06322, 2023.
  • [74] Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Towards accurate data-free quantization for diffusion models. arXiv preprint arXiv:2305.18723, 2023.
  • [75] Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junyi Wu, and Yan Yan. Quest: Low-bit diffusion model quantization via efficient selective finetuning. arXiv preprint arXiv:2402.03666, 2024.
  • [76] Yuewei Yang, Xiaoliang Dai, Jialiang Wang, Peizhao Zhang, and Hongbo Zhang. Efficient quantization strategies for latent diffusion models. arXiv preprint arXiv:2312.05431, 2023.
  • [77] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
  • [78] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
  • [79] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [80] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • [81] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.
  • [82] Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023.
  • [83] Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, and Xianglong Liu. Binarydm: Towards accurate binarization of diffusion model. arXiv preprint arXiv:2404.05662, 2024.
  • [84] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International Conference on Learning Representations, 2016.
\doparttoc\faketableofcontents

Appendix

\parttoc

Appendix A Limitations

In this work, we study the storage size reduction of the UNet in Stable Diffusion v1.5 through weight quantization. The compression of VAE and CLIP text encoder [58] is also an interesting direction, which is not explored in this work. Additionally, our weight quantization techniques could be extended to the activations quantization, as a future exploration.

Appendix B More details for Mixed-Precision Algorithm

In Sec. 3, we analyze the per-layer quantization error and develop the mixed-precision strategy. Here, we provide the detailed algorithm as outlined in Alg. 1. The inputs include: a pre-defined candidate set of bit-width b{1,2,3}𝑏123b\in\{1,2,3\}italic_b ∈ { 1 , 2 , 3 }, the full-precision SD-v1.5 D𝐷Ditalic_D, the total number of layers L𝐿Litalic_L (except for the time embedding, time projection, the first and last convolutional layers), the training dataset X𝑋Xitalic_X, the number of training iterations T𝑇Titalic_T, the number of evaluation images for calculating metrics K𝐾Kitalic_K, the bit threshold 𝒮osubscript𝒮𝑜\mathcal{S}_{o}caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the parameter size factor η𝜂\etaitalic_η, and the number of parameters of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In the first stage, we aim to obtain quantized models by quantizing each individual layer. Given the full-precision SD-v1.5 UNet D𝐷Ditalic_D, we consecutively perform the quantization on every single layer to 1111, 2222, or 3333 bits individually, while maintaining the remaining layers at FP32 format. Notice, to align with our experiments, we add the balance integer and initialize the scaling factor with our alternating optimization. For each quantized model, the weights and scaling factors are fine-tuned using quantization-aware training to minimize the quantization error by learning the predicted noise of the SD-v1.5. We obtain quantized models Di,b,i=1,2,,L,b=1,2,3formulae-sequencesubscript𝐷𝑖𝑏𝑖12𝐿𝑏123D_{i,b},i=1,2,\cdots,L,b=1,2,3italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT , italic_i = 1 , 2 , ⋯ , italic_L , italic_b = 1 , 2 , 3.

In the second stage, we measure the quantization error of each layer by calculating various metrics from comparing images generated by the quantized model Di,bsubscript𝐷𝑖𝑏D_{i,b}italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT with those from the unquantized SD-v1.5 D𝐷Ditalic_D. Specifically, we generate K=100𝐾100K=100italic_K = 100 baseline images Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT from the full-precision SD-v1.5 model with PartiPrompts. Then, for each quantized model Di,bsubscript𝐷𝑖𝑏D_{i,b}italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT, we use identical prompts and seed to generate corresponding images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT. We calculate the quantization error by measuring the metrics including MSE, CLIP score, PSNR, and LPIPS using these images and prompts.

In the third stage, we collect the mixed-precision recipe. We first compute a sensitivity score for each layer, factoring in both the MSE and the parameter size adjusted by η𝜂\etaitalic_η. For the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer, its sensitivity score for the b𝑏bitalic_b-bits (b{1,2,3}𝑏123b\in\{1,2,3\}italic_b ∈ { 1 , 2 , 3 }) is defined as 𝒮i,b=Mi,bNiηsubscript𝒮𝑖𝑏subscript𝑀𝑖𝑏subscriptsuperscript𝑁𝜂𝑖\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - italic_η end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where M𝑀Mitalic_M denotes the MSE error, N𝑁Nitalic_N is the total number of parameters of the layer, and η[0,1]𝜂01\eta\in[0,1]italic_η ∈ [ 0 , 1 ] denotes the parameter size factor. To determine the bit width (i.e., bsuperscript𝑏b^{*}italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) for each layer, we define a sensitivity threshold as 𝒮osubscript𝒮𝑜\mathcal{S}_{o}caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer is assigned to bisuperscriptsubscript𝑏𝑖b_{i}^{*}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-bits, where bi=min{b|𝒮i,b<𝒮o}superscriptsubscript𝑏𝑖conditional𝑏subscript𝒮𝑖𝑏subscript𝒮𝑜b_{i}^{*}=\min\{b|\mathcal{S}_{i,b}<\mathcal{S}_{o}\}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min { italic_b | caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT < caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT }. The remaining layers are set as 4 bits if they fail to meet the threshold. After determining the initial bits based on the MSE error, we refine this recipe by considering the degradation in the CLIP score associated with each bit-width. We simply consider the CLIP score change at 3 bits. We assign layers with the highest 10%,5%,2%percent10percent5percent210\%,5\%,2\%10 % , 5 % , 2 % CLIP score drop with 1, 2, 3 more bits, respectively.

The final output is a mixed-precision recipe {bi},i=1,2,,Lformulae-sequencesuperscriptsubscript𝑏𝑖𝑖12𝐿\{b_{i}^{*}\},i=1,2,\cdots,L{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , italic_i = 1 , 2 , ⋯ , italic_L, specifying the bit-width for each layer. Then, we set the first and last convolutional layers as 8888 bits and pre-computing and caching the time embedding and projection layers.

Algorithm 1 Mixed-Precision Algorithm

Input: Candidate bits set b{1,2,3}𝑏123b\in\{1,2,3\}italic_b ∈ { 1 , 2 , 3 }, SD-v1.5 model D𝐷Ditalic_D, number of total layers L𝐿Litalic_L (except for the time embedding, time projection, the first and last convolutional layers), dataset X𝑋Xitalic_X, training iterations T𝑇Titalic_T, number of evaluation images K𝐾Kitalic_K, threshold Sosubscript𝑆𝑜S_{o}italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, parameter size factor η𝜂\etaitalic_η, number of parameters of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Output: Mixed precision recipe {bi},i=1,2,,Lformulae-sequencesuperscriptsubscript𝑏𝑖𝑖12𝐿\{b_{i}^{*}\},i=1,2,\cdots,L{ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , italic_i = 1 , 2 , ⋯ , italic_L.

1:  1: Obtaining the quantized models.
2:  for b=1𝑏1b=1italic_b = 1 to 3333 do
3:     for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
4:        Quantize the i𝑖iitalic_i-th layer to b𝑏bitalic_b bits via Eq. (1) and proposed initialization methods in Sec. 4.1 to get model Di,bsubscript𝐷𝑖𝑏D_{i,b}italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT;
5:        for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
6:           Updating weights and scaling factors by minimizing the quantization error using quantization-aware training on Di,bsubscript𝐷𝑖𝑏D_{i,b}italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT with data X𝑋Xitalic_X;
7:        end for
8:     end for
9:  end for
10:  2: Calculating quantization error metrics.
11:  Generating K𝐾Kitalic_K images Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT via SD-v1.5;
12:  for b=1𝑏1b=1italic_b = 1 to 3333 do
13:     for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
14:        Generating K𝐾Kitalic_K images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT via quantized model Di,bsubscript𝐷𝑖𝑏D_{i,b}italic_D start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT;
15:        Calculating MSE, Mi,bsubscript𝑀𝑖𝑏M_{i,b}italic_M start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT via images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT and Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT;
16:        Calculating PSNR, Pi,bsubscript𝑃𝑖𝑏P_{i,b}italic_P start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT via images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT and Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT;
17:        Calculating LPIPS, Li,bsubscript𝐿𝑖𝑏L_{i,b}italic_L start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT via images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT and Idsubscript𝐼𝑑I_{d}italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT;
18:        Calculating CLIP score drop, Ci,bsubscript𝐶𝑖𝑏C_{i,b}italic_C start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT via images Ii,bsubscript𝐼𝑖𝑏I_{i,b}italic_I start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT and prompts;
19:     end for
20:  end for
21:  2: Deciding the optimal precision.
22:  Calculating sensitivity score 𝒮i,b=Mi,bNiηsubscript𝒮𝑖𝑏subscript𝑀𝑖𝑏subscriptsuperscript𝑁𝜂𝑖\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT - italic_η end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
23:  for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
24:     bi4superscriptsubscript𝑏𝑖4b_{i}^{*}\leftarrow 4italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← 4;
25:     for b=3𝑏3b=3italic_b = 3 to 1111 do
26:        if 𝒮i,b<𝒮osubscript𝒮𝑖𝑏subscript𝒮𝑜\mathcal{S}_{i,b}<\mathcal{S}_{o}caligraphic_S start_POSTSUBSCRIPT italic_i , italic_b end_POSTSUBSCRIPT < caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT then
27:           Assign the i𝑖iitalic_i-th layer with b𝑏bitalic_b bits with bibsuperscriptsubscript𝑏𝑖𝑏b_{i}^{*}\leftarrow bitalic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_b;
28:        end if
29:     end for
30:  end for
31:  Calculating CLIP score drop, Ci,3subscript𝐶𝑖3C_{i,3}italic_C start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT and its pthsubscript𝑝𝑡p_{th}italic_p start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT percentile Cpsubscript𝐶𝑝C_{p}italic_C start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT;
32:  for i=1𝑖1i=1italic_i = 1 to L𝐿Litalic_L do
33:     if Ci,3>C90subscript𝐶𝑖3subscript𝐶90{C}_{i,3}>C_{90}italic_C start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT > italic_C start_POSTSUBSCRIPT 90 end_POSTSUBSCRIPT then
34:        bibi+1superscriptsubscript𝑏𝑖superscriptsubscript𝑏𝑖1b_{i}^{*}\leftarrow b_{i}^{*}+1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1;
35:     end if
36:     if Ci,3>C95subscript𝐶𝑖3subscript𝐶95{C}_{i,3}>C_{95}italic_C start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT > italic_C start_POSTSUBSCRIPT 95 end_POSTSUBSCRIPT then
37:        bibi+1superscriptsubscript𝑏𝑖superscriptsubscript𝑏𝑖1b_{i}^{*}\leftarrow b_{i}^{*}+1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1;
38:     end if
39:     if Ci,3>C98subscript𝐶𝑖3subscript𝐶98{C}_{i,3}>C_{98}italic_C start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT > italic_C start_POSTSUBSCRIPT 98 end_POSTSUBSCRIPT then
40:        bibi+1superscriptsubscript𝑏𝑖superscriptsubscript𝑏𝑖1b_{i}^{*}\leftarrow b_{i}^{*}+1italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1;
41:     end if
42:  end for

Appendix C More Details for Time Embedding Pre-computing and Caching

In Sec. 4.1, we introduce "Time Embedding Pre-computing and Caching". Here, we provide more details for the algorithm. In the Stable Diffusion model, the time step t[0,1,,999]𝑡01999t\in\left[0,1,\cdots,999\right]italic_t ∈ [ 0 , 1 , ⋯ , 999 ] is transformed into a time embedding embtsubscriptemb𝑡\texttt{emb}_{t}emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT through the equation embt=e(t)subscriptemb𝑡𝑒𝑡\texttt{emb}_{t}=e(t)emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_e ( italic_t ), where e(t)𝑒𝑡e(t)italic_e ( italic_t ) denotes the time embedding layer and embtdtesubscriptemb𝑡superscriptsubscript𝑑𝑡𝑒\texttt{emb}_{t}\in\mathbb{R}^{d_{te}}emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In SD-v1.5, dte=1280subscript𝑑𝑡𝑒1280d_{te}=1280italic_d start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT = 1280. Then, for each ResBlock, denoted as Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,2,,Nr𝑖12subscript𝑁𝑟i=1,2,\cdots,N_{r}italic_i = 1 , 2 , ⋯ , italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, where Nrsubscript𝑁𝑟N_{r}italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is total number of ResBlocks with time projection layers, the embtsubscriptemb𝑡\texttt{emb}_{t}emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is encoded by time projection layers ri()subscript𝑟𝑖r_{i}(\cdot)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) by Fi,t=ri(embt)subscriptF𝑖𝑡subscript𝑟𝑖subscriptemb𝑡\texttt{F}_{i,t}=r_{i}(\texttt{emb}_{t})F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Notice that ri()subscript𝑟𝑖r_{i}(\cdot)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) and e()𝑒e(\cdot)italic_e ( ⋅ ) are both linear layers. Finally, Fi,tsubscriptF𝑖𝑡\texttt{F}_{i,t}F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is applied to the intermediate activations of each Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT via addition operation, effectively incorporating temporal information into the Stable Diffusion model.

As observed before [27], time embedding and projection layers exhibit considerable sensitivity to quantization during PTQ on DM. To address this problem, existing work specifically pays attention to reconstructing layers related to time embedding [27]. In this study, we propose a more effective method. We observe that 1) during the inference stage, for each time step t𝑡titalic_t, the embtsubscriptemb𝑡\texttt{emb}_{t}emb start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and consequently Fi,tsubscriptF𝑖𝑡\texttt{F}_{i,t}F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT remain constant. 2) In the Stable Diffusion model, the shape of Fi,tsubscriptF𝑖𝑡\texttt{F}_{i,t}F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT are considerably smaller compared to time embedding and projection layers. Specifically, in SD-v1.5, Fi,tsubscriptF𝑖𝑡\texttt{F}_{i,t}F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is with the dimension in {320,640,1280}3206401280\{320,640,1280\}{ 320 , 640 , 1280 } which is largely smaller than time projection layers WrD×1280subscript𝑊𝑟superscript𝐷1280W_{r}\in\mathbb{R}^{D\times 1280}italic_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 1280 end_POSTSUPERSCRIPT, where D{320,640,1280}𝐷3206401280D\in\{320,640,1280\}italic_D ∈ { 320 , 640 , 1280 }. Therefore, we introduce an efficient and lossless method named Time Embedding Pre-computing and Caching. Specifically, for total Tinfsubscript𝑇infT_{\texttt{inf}}italic_T start_POSTSUBSCRIPT inf end_POSTSUBSCRIPT inference time steps, we opt to store only Tinfsubscript𝑇infT_{\texttt{inf}}italic_T start_POSTSUBSCRIPT inf end_POSTSUBSCRIPT time features, rather than retaining the original time embedding layers e()𝑒e(\cdot)italic_e ( ⋅ ) and the time projection layers in the i𝑖iitalic_i-th ResBlock ri()subscript𝑟𝑖r_{i}(\cdot)italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ).

The inference time steps are set as 50 or less in most Stable Diffusion models. This method significantly reduces more than 1280/50=25.6×1280/50=25.6\times1280 / 50 = 25.6 × storage requirements and entire computational costs in terms of time-related layers. Given that the storage size of the pre-computed Fi,tsubscriptF𝑖𝑡\texttt{F}_{i,t}F start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is substantially smaller than that of the original linear layers, this approach effectively diminishes the average bit of our quantized model without any performance degradation.

Appendix D Analysis of Symmetric Weight Distribution

In Sec. 4.1, we introduce "Adding Balance Integer" by assuming the weight distribution in Stable Diffusion is symmetric. Here, we provide more analysis for the assumption. To verify the weight distribution is symmetric around zero in SD-v1.5, we measure the skewness of the weight distribution of each layer. Lower skewness indicates a more symmetric weight distribution. As illustrated in Fig. 6, 97%percent9797\%97 % of layers exhibiting skewness between [-0.5, 0.5], this suggests that most layers in SD-v1.5 have symmetric weight distributions.

Refer to caption
Figure 6: Skewness of weight distribution of each layer in SD-v1.5. Lower skewness represents the weight distribution is more symmetric. 97%percent9797\%97 % layers are with skewness between [-0.5, 0.5], indicating that most layers have symmetric weight distribution in SD-v1.5.

Appendix E More Details for Quantization Error Across Different Time Steps

In Sec. 4.2, we introduce the "Quantization Error-aware Time Step Sampling" method. Here, we provide more details for measuring the quantization error from the predicted latent instead of the predicted noise. During the inference stage, the actual denoising step requires the scaling operation on the predicted noise in diffusion models. Therefore, directly calculating the quantization error via noise prediction is not accurate. Instead, we calculate the quantization error in the latent feature space. We derive the relationship of quantization error calculated from the predicted latent and noise as follows:

E𝐸\displaystyle Eitalic_E =𝔼t,𝐱[𝐳^𝜽fp(t,𝐳t,𝐜)𝐳^𝜽int,𝐬(t,𝐳t,𝐜)2],absentsubscript𝔼𝑡𝐱delimited-[]superscriptnormsubscript^𝐳subscript𝜽fp𝑡subscript𝐳𝑡𝐜subscript^𝐳subscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜2\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\|\mathbf{\hat{z}}_{\boldsymbol{% \theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\mathbf{\hat{z}}_{% \boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\|^% {2}\right],= blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int , bold_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (9)
=𝔼t,𝐱[(1α¯t𝐳t1α¯tα¯tϵ^𝜽fp(t,𝐳t,𝐜))(1α¯t𝐳t1α¯tα¯tϵ^𝜽int,𝐬(t,𝐳t,𝐜))2],absentsubscript𝔼𝑡𝐱delimited-[]superscriptnorm1subscript¯𝛼𝑡subscript𝐳𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡subscriptbold-^bold-italic-ϵsubscript𝜽fp𝑡subscript𝐳𝑡𝐜1subscript¯𝛼𝑡subscript𝐳𝑡1subscript¯𝛼𝑡subscript¯𝛼𝑡subscriptbold-^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜2\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\left\|\left(\frac{1}{\sqrt{\bar{% \alpha}_{t}}}\mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha% }_{t}}}\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{fp}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)-\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}}% \mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}}% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)\right\|^{2}\right],= blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ ∥ ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ) - ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int , bold_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,
=𝔼t,𝐱[1α¯tα¯tϵ^𝜽fp(t,𝐳t,𝐜)ϵ^𝜽int,𝐬(t,𝐳t,𝐜)2],absentsubscript𝔼𝑡𝐱delimited-[]1subscript¯𝛼𝑡subscript¯𝛼𝑡superscriptnormsubscriptbold-^bold-italic-ϵsubscript𝜽fp𝑡subscript𝐳𝑡𝐜subscriptbold-^bold-italic-ϵsubscript𝜽int𝐬𝑡subscript𝐳𝑡𝐜2\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\frac{1-\bar{\alpha}_{t}}{\bar{% \alpha}_{t}}\left\|\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{% fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\boldsymbol{\hat{\epsilon}}_{\boldsymbol{% \theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\right\|^{2}% \right],= blackboard_E start_POSTSUBSCRIPT italic_t , bold_x end_POSTSUBSCRIPT [ divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT fp end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) - overbold_^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT int , bold_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the noise scheduler in [21].

Appendix F Detailed Metrics for Quantization Error by Quantizing Different Layers

In Sec. 3.2, we calculate the various metrics for representing the quantization error when quantizing different layers. Here, we provide detailed metrics when quantizing each layer of SD-v1.5 to 1, 2, and 3 bits.

Refer to caption
(a) MSE value caused by the 1-bit quantized layers in SD-v1.5.
Refer to caption
(b) MSE value caused by the 2-bit quantized layers in SD-v1.5.
Refer to caption
(c) MSE value caused by the 3-bit quantized layers in SD-v1.5.
Figure 7: MSE value caused by the quantized layers in SD-v1.5..
Refer to caption
(a) CLIP score degradation caused by the 1-bit quantized layers in SD-v1.5.
Refer to caption
(b) CLIP score degradation caused by the 2-bit quantized layers in SD-v1.5.
Refer to caption
(c) CLIP score degradation caused by the 3-bit quantized layers in SD-v1.5.
Figure 8: CLIP score degradation caused by quantized layers in SD-v1.5.
Refer to caption
(a) LPIPS value of the 1-bit quantized layers in SD-v1.5.
Refer to caption
(b) LPIPS value of the 2-bit quantized layers in SD-v1.5.
Refer to caption
(c) LPIPS value of the 3-bit quantized layers in SD-v1.5.
Figure 9: LPIPS value of quantized layers in SD-v1.5.
Refer to caption
(a) PSNR value of the 1-bit quantized layers in SD-v1.5.
Refer to caption
(b) PSNR value of the 2-bit quantized layers in SD-v1.5.
Refer to caption
(c) PSNR value of the 3-bit quantized layers in SD-v1.5.
Figure 10: PSNR value of quantized layers in SD-v1.5.

Appendix G More Visualization for Quantization Error by Quantizing Different Layers

In Sec. 3.2, we show the images for representing the quantization error when quantizing different layers. Here, we provide more visualization for demonstrating the different quantization errors caused by quantizing different layers to 1 bit. The quantized layers from left to right correspond to the annotated layers at the bottom: SD-v1.5 w/o quantization, Down.0.0.attn2.toq, Down.0.0.attn2.tok, Down.0.0.attn2.tov, Down.1.0.attn2.tok, Down.1.1.attn2.tok, Down.2.res.0.conv1, Up.2.res.2.convshortcut, Up.3.2.attn2.tok, Up.3.res.2.convshortcut.

Refer to caption
Figure 11: Quantization errors demonstrated in generated images (via PartiPrompts) after performing 1-bit quantization on different individual layers.

Appendix H 1.99 Bits Mixed Precision Recipe

We provide our 1.99 bits recipe in our experiments. During the training and inference stage, we add a balancing integer to the n𝑛nitalic_n-bits values, resulting in log(2n+1)logsuperscript2𝑛1\texttt{log}(2^{n}+1)log ( 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ) bits. We calculate the average bits by ilog(2bi+1)×Ni+16NtfNwsubscript𝑖logsuperscript2superscriptsubscript𝑏𝑖1subscript𝑁𝑖16subscript𝑁𝑡𝑓subscript𝑁𝑤\frac{\sum_{i}\texttt{log}(2^{b_{i}^{*}}+1)\times N_{i}+16*N_{tf}}{N_{w}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT log ( 2 start_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + 1 ) × italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 16 ∗ italic_N start_POSTSUBSCRIPT italic_t italic_f end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG, where bisuperscriptsubscript𝑏𝑖b_{i}^{*}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the calculated bit-width in the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer, Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of weights of the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer, Ntfsubscript𝑁𝑡𝑓N_{tf}italic_N start_POSTSUBSCRIPT italic_t italic_f end_POSTSUBSCRIPT is the number of parameters for pre-cached time features, and Nwsubscript𝑁𝑤N_{w}italic_N start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the total number of weights in linear and convolutional layers. We calculate the model size by integrating all other parameters as 32 bits. The index and name of each layer are listed:

1 down_blocks.0.attentions.0.proj_in: 6
2 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q: 5
3 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k: 5
4 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v: 4
5 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0: 6
6 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_q: 5
7 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k: 7
8 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v: 3
9 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
10 down_blocks.0.attentions.0.transformer_blocks.0.ff.net.0.proj: 3
11 down_blocks.0.attentions.0.transformer_blocks.0.ff.net.2: 3
12 down_blocks.0.attentions.0.proj_out: 5
13 down_blocks.0.attentions.1.proj_in: 4
14 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_q: 3
15 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_k: 4
16 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_v: 6
17 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_out.0: 5
18 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_q: 5
19 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k: 7
20 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v: 2
21 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
22 down_blocks.0.attentions.1.transformer_blocks.0.ff.net.0.proj: 3
23 down_blocks.0.attentions.1.transformer_blocks.0.ff.net.2: 3
24 down_blocks.0.attentions.1.proj_out: 6
25 down_blocks.0.resnets.0.conv1: 3
26 down_blocks.0.resnets.0.conv2: 3
27 down_blocks.0.resnets.1.conv1: 3
28 down_blocks.0.resnets.1.conv2: 4
29 down_blocks.0.downsamplers.0.conv: 4
30 down_blocks.1.attentions.0.proj_in: 4
31 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q: 3
32 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k: 3
33 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v: 4
34 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0: 4
35 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q: 3
36 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k: 5
37 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v: 4
38 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
39 down_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
40 down_blocks.1.attentions.0.transformer_blocks.0.ff.net.2: 2
41 down_blocks.1.attentions.0.proj_out: 4
42 down_blocks.1.attentions.1.proj_in: 4
43 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q: 2
44 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k: 2
45 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v: 4
46 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0: 4
47 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q: 3
48 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k: 6
49 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v: 4
50 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
51 down_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
52 down_blocks.1.attentions.1.transformer_blocks.0.ff.net.2: 2
53 down_blocks.1.attentions.1.proj_out: 4
54 down_blocks.1.resnets.0.conv1: 3
55 down_blocks.1.resnets.0.conv2: 3
56 down_blocks.1.resnets.0.conv_shortcut: 7
57 down_blocks.1.resnets.1.conv1: 3
58 down_blocks.1.resnets.1.conv2: 2
59 down_blocks.1.downsamplers.0.conv: 4
60 down_blocks.2.attentions.0.proj_in: 3
61 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q: 3
62 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k: 2
63 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v: 3
64 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
65 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q: 3
66 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k: 4
67 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v: 4
68 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
69 down_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
70 down_blocks.2.attentions.0.transformer_blocks.0.ff.net.2: 1
71 down_blocks.2.attentions.0.proj_out: 3
72 down_blocks.2.attentions.1.proj_in: 4
73 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q: 4
74 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k: 2
75 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v: 3
76 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
77 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q: 3
78 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k: 4
79 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v: 4
80 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
81 down_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
82 down_blocks.2.attentions.1.transformer_blocks.0.ff.net.2: 2
83 down_blocks.2.attentions.1.proj_out: 4
84 down_blocks.2.resnets.0.conv1: 3
85 down_blocks.2.resnets.0.conv2: 2
86 down_blocks.2.resnets.0.conv_shortcut: 4
87 down_blocks.2.resnets.1.conv1: 2
88 down_blocks.2.resnets.1.conv2: 1
89 down_blocks.2.downsamplers.0.conv: 1
90 down_blocks.3.resnets.0.conv1: 1
91 down_blocks.3.resnets.0.conv2: 1
92 down_blocks.3.resnets.1.conv1: 1
93 down_blocks.3.resnets.1.conv2: 1
94 up_blocks.0.resnets.0.conv1: 1
95 up_blocks.0.resnets.0.conv2: 2
96 up_blocks.0.resnets.0.conv_shortcut: 1
97 up_blocks.0.resnets.1.conv1: 1
98 up_blocks.0.resnets.1.conv2: 1
99 up_blocks.0.resnets.1.conv_shortcut: 1
100 up_blocks.0.resnets.2.conv1: 2
101 up_blocks.0.resnets.2.conv2: 1
102 up_blocks.0.resnets.2.conv_shortcut: 1
103 up_blocks.0.upsamplers.0.conv: 1
104 up_blocks.1.attentions.0.proj_in: 3
105 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q: 2
106 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k: 1
107 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v: 2
108 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
109 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q: 3
110 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k: 4
111 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v: 4
112 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0: 2
113 up_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
114 up_blocks.1.attentions.0.transformer_blocks.0.ff.net.2: 2
115 up_blocks.1.attentions.0.proj_out: 3
116 up_blocks.1.attentions.1.proj_in: 3
117 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q: 2
118 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k: 2
119 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v: 2
120 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0: 2
121 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q: 2
122 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k: 4
123 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v: 3
124 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
125 up_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj: 1
126 up_blocks.1.attentions.1.transformer_blocks.0.ff.net.2: 1
127 up_blocks.1.attentions.1.proj_out: 3
128 up_blocks.1.attentions.2.proj_in: 3
129 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_q: 1
130 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_k: 1
131 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_v: 2
132 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_out.0: 2
133 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_q: 1
134 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k: 3
135 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v: 2
136 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
137 up_blocks.1.attentions.2.transformer_blocks.0.ff.net.0.proj: 1
138 up_blocks.1.attentions.2.transformer_blocks.0.ff.net.2: 1
139 up_blocks.1.attentions.2.proj_out: 2
140 up_blocks.1.resnets.0.conv1: 1
141 up_blocks.1.resnets.0.conv2: 1
142 up_blocks.1.resnets.0.conv_shortcut: 3
143 up_blocks.1.resnets.1.conv1: 1
144 up_blocks.1.resnets.1.conv2: 1
145 up_blocks.1.resnets.1.conv_shortcut: 3
146 up_blocks.1.resnets.2.conv1: 1
147 up_blocks.1.resnets.2.conv2: 1
148 up_blocks.1.resnets.2.conv_shortcut: 3
149 up_blocks.1.upsamplers.0.conv: 2
150 up_blocks.2.attentions.0.proj_in: 4
151 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q: 2
152 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k: 2
153 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v: 3
154 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
155 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q: 1
156 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k: 2
157 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v: 1
158 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0: 1
159 up_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj: 1
160 up_blocks.2.attentions.0.transformer_blocks.0.ff.net.2: 1
161 up_blocks.2.attentions.0.proj_out: 3
162 up_blocks.2.attentions.1.proj_in: 4
163 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q: 2
164 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k: 3
165 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v: 3
166 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
167 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q: 1
168 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k: 3
169 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v: 1
170 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
171 up_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj: 1
172 up_blocks.2.attentions.1.transformer_blocks.0.ff.net.2: 1
173 up_blocks.2.attentions.1.proj_out: 3
174 up_blocks.2.attentions.2.proj_in: 4
175 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_q: 2
176 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_k: 2
177 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_v: 2
178 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_out.0: 3
179 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_q: 2
180 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k: 3
181 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v: 1
182 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
183 up_blocks.2.attentions.2.transformer_blocks.0.ff.net.0.proj: 1
184 up_blocks.2.attentions.2.transformer_blocks.0.ff.net.2: 1
185 up_blocks.2.attentions.2.proj_out: 3
186 up_blocks.2.resnets.0.conv1: 1
187 up_blocks.2.resnets.0.conv2: 2
188 up_blocks.2.resnets.0.conv_shortcut: 4
189 up_blocks.2.resnets.1.conv1: 1
190 up_blocks.2.resnets.1.conv2: 2
191 up_blocks.2.resnets.1.conv_shortcut: 4
192 up_blocks.2.resnets.2.conv1: 1
193 up_blocks.2.resnets.2.conv2: 1
194 up_blocks.2.resnets.2.conv_shortcut: 4
195 up_blocks.2.upsamplers.0.conv: 3
196 up_blocks.3.attentions.0.proj_in: 4
197 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_q: 2
198 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_k: 2
199 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_v: 6
200 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
201 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_q: 2
202 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k: 3
203 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v: 1
204 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0: 1
205 up_blocks.3.attentions.0.transformer_blocks.0.ff.net.0.proj: 1
206 up_blocks.3.attentions.0.transformer_blocks.0.ff.net.2: 1
207 up_blocks.3.attentions.0.proj_out: 4
208 up_blocks.3.attentions.1.proj_in: 4
209 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_q: 2
210 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_k: 3
211 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_v: 5
212 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
213 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_q: 2
214 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k: 3
215 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v: 1
216 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
217 up_blocks.3.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
218 up_blocks.3.attentions.1.transformer_blocks.0.ff.net.2: 1
219 up_blocks.3.attentions.1.proj_out: 4
220 up_blocks.3.attentions.2.proj_in: 6
221 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q: 2
222 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k: 3
223 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v: 4
224 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0: 3
225 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q: 4
226 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k: 5
227 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v: 1
228 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
229 up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj: 3
230 up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2: 2
231 up_blocks.3.attentions.2.proj_out: 4
232 up_blocks.3.resnets.0.conv1: 1
233 up_blocks.3.resnets.0.conv2: 2
234 up_blocks.3.resnets.0.conv_shortcut: 4
235 up_blocks.3.resnets.1.conv1: 2
236 up_blocks.3.resnets.1.conv2: 2
237 up_blocks.3.resnets.1.conv_shortcut: 4
238 up_blocks.3.resnets.2.conv1: 2
239 up_blocks.3.resnets.2.conv2: 2
240 up_blocks.3.resnets.2.conv_shortcut: 4
241 mid_block.attentions.0.proj_in: 2
242 mid_block.attentions.0.transformer_blocks.0.attn1.to_q: 3
243 mid_block.attentions.0.transformer_blocks.0.attn1.to_k: 1
244 mid_block.attentions.0.transformer_blocks.0.attn1.to_v: 2
245 mid_block.attentions.0.transformer_blocks.0.attn1.to_out.0: 2
246 mid_block.attentions.0.transformer_blocks.0.attn2.to_q: 1
247 mid_block.attentions.0.transformer_blocks.0.attn2.to_k: 4
248 mid_block.attentions.0.transformer_blocks.0.attn2.to_v: 4
249 mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
250 mid_block.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
251 mid_block.attentions.0.transformer_blocks.0.ff.net.2: 1
252 mid_block.attentions.0.proj_out: 3
253 mid_block.resnets.0.conv1: 1
254 mid_block.resnets.0.conv2: 1
255 mid_block.resnets.1.conv1: 1
256 mid_block.resnets.1.conv2: 1
conv_in: 8
conv_out: 8

Appendix I Details for Evaluation Metrics

In Sec. 5, we measure the performance on various metrics such as TIFA, GenEval, CLIP score and FID. Here, we provide more details for these metrics.

TIFA Score. TIFA v1.0 [26] aims to measure the faithfulness of generated images. It includes various 4K text prompts sampled from the MS-COCO captions [44], DrawBench [64], PartiPrompts [77], and PaintSkill [6], associated with a pre-generated set of question-answer pairs, resulting in 25K questions covering 4.5K diverse elements. Image faithfulness is measured by determining if the VQA model can accurately answer the questions from the generated images.

GenEval Score. GenEval [14] measures the consistency between the generated images and the description, including 6 different tasks: single object, two object, counting, colors, position, color attribution. All text prompts are generated from task-specific templates filled in with: randomly sampled object names from MS-COCO [44], colors from Berlin-Kay basic color theory, numbers with 2, 3, 4, and relative positions from "above", "below", "to the left of", or "to the right of". We adopt the pre-trained object detection model Mask2Former (Swin-S-8×\times×2) [5] for evaluation.

CLIP score and FID. CLIP score measures measure the similarity between text prompts and corresponding generated images. FID is used to evaluate the quality of generated images by measuring the distance between the distributions of features extracted from generated images and target images. In the main experiments, evaluation are measured based on MS-COCO 2014 validation set with 30K image-caption pairs [44]. We adopt ViT-B/32 model to evaluate the CLIP score in our experiments.

Appendix J Human Evaluation

In Sec. 5, we provide the human evaluation results. Here, we provide more detailed human evaluation with category and challenge comparisons on PartiPrompts (P2), comparing Stable Diffusion v1.5 and BitsFusion, with the question: Given a prompt, which image has better aesthetics and image-text alignment? Our model is selected 888 times out of 1632 comparisons, indicating a general preference over SD-v1.5, which is chosen 744 times, demonstrating more appealing and accurate generated images.

Refer to caption
Figure 12: Overall human evaluation comparisons between SD-v1.5 and BitsFusion. Notably, BitsFusion, is favored 54.41% of the time over SD-v1.5.
Refer to caption
Figure 13: Human evaluation across particular categories.
Refer to caption
Figure 14: Human evaluation across particular challenges.

J.1 Analysis on Categories

Illustrations, People, and Arts. Our model significantly outperforms SD-v1.5 in generating illustrations (77 wins out of 124), images of people (101 out of 174), and arts (45 out of 65).

Outdoor and Indoor Scenes. Our model also shows strength in generating both outdoor (73 out of 131) and indoor scenes (23 out of 40), suggesting better environmental rendering capabilities.

J.2 Analysis on Challenges

Complex and Fine-grained Detail: Our model excels in generating images with complex details (73 out of 113) and fine-grained details (173 out of 312), suggesting advanced capabilities in maintaining detail at varying complexity levels.

Imagination and Style & Format: Our model also shows a strong performance in tasks requiring imaginative (92 out of 149) and stylistic diversity (118 out of 204), highlighting its flexibility and creative handling of artistic elements.

The strong performance in imaginative and artistic categories presents an opportunity to target applications in creative industries, such as digital art and entertainment, where these capabilities can be particularly valuable.

Appendix K Evaluation on Different Schedulers

In the main experiments in Sec. 5, we leverage the PNDM scheduler to generate images. Here, we measured the performance of different schedulers, such as DDIM [70] and DPMSolver [50], to demonstrate the generality and effectiveness of BitsFusion. We set 50 inference steps and fix the random seed as 1024. As shown in Fig. 15, BitsFusionconsistently outperforms SD-v1.5 with different schedulers.

Refer to caption Refer to caption
(a) DDIM (b) DPMSolver
Figure 15: TIFA scores comparisons between SD-v1.5 and BitsFusion, with different schedulers. Left: TIFA scores measured with DDIM [70] scheduler. Right: TIFA score measured with DPMSolver [50] scheduler.

Appendix L FID Results and Discussion

As stated in SDXL [55] and PickScore [35], FID may not honestly reflect the actual performance of the model in practice. FID measures the average distance between generated images and reference real images, which is largely influenced by the training datasets. Also, FID does not capture the human preference which is the crucial metric for evaluating text-to-image synthesis. Nevertheless, we present FID results evaluated on the 30K MS-COCO 2014 validation set in Fig. 16. Our Stage-I model has a similar FID as SD-v1.5. However, as training progresses, although our Stage-II model is preferred by users, its FID score is higher than both Stage-I and SD-v1.5.

Refer to caption
Figure 16: FID results evaluated on 30K MS-COCO 2014 validation set.

Appendix M Detailed GenEval Results

In Sec. 5, we provide the overall GenEval results. Here, we provide detailed GenEval results for further comparisons as illustrated in Tab. 8.

Table 8: Detailed GenEval with different CFG scales.
Method Overall Single Two Counting Colors Position Color Attribution
Guidance Scale = 2.5
SD-v1.5 0.3589 0.9350 0.2626 0.2775 0.6043 0.0340 0.0400
Ours-I 0.3353 0.9075 0.2444 0.2550 0.5426 0.0280 0.0340
Ours-II 0.4024 0.8975 0.3859 0.2750 0.6979 0.0560 0.1020
Guidance Scale = 3.5
SD-v1.5 0.3879 0.9400 0.3010 0.3275 0.6787 0.0300 0.0500
Ours-I 0.3650 0.9500 0.2808 0.2575 0.6277 0.0280 0.0460
Ours-II 0.4370 0.9350 0.4727 0.3125 0.7340 0.0600 0.1080
Guidance Scale = 4.5
SD-v1.5 0.4056 0.9700 0.3010 0.3200 0.7426 0.0340 0.0660
Ours-I 0.3851 0.9500 0.3091 0.3100 0.6574 0.0340 0.0500
Ours-II 0.4516 0.9575 0.4788 0.3450 0.7723 0.0520 0.1040
Guidance Scale = 5.5
SD-v1.5 0.4094 0.9750 0.3111 0.3325 0.7319 0.0400 0.0660
Ours-I 0.4039 0.9675 0.3232 0.3425 0.7000 0.0300 0.0600
Ours-II 0.4567 0.9600 0.4909 0.3175 0.7979 0.0540 0.1200
Guidance Scale = 6.5
SD-v1.5 0.4224 0.9800 0.3293 0.3725 0.7447 0.0400 0.0680
Ours-I 0.4161 0.9675 0.3414 0.3350 0.7425 0.0360 0.0740
Ours-II 0.4612 0.9750 0.4990 0.3275 0.7957 0.0540 0.1160
Guidance Scale = 7.5
SD-v1.5 0.4262 0.9775 0.3313 0.3850 0.7596 0.0440 0.0600
Ours-I 0.4226 0.9775 0.3495 0.3600 0.7447 0.0360 0.0680
Ours-II 0.4682 0.9800 0.5091 0.3300 0.8085 0.0680 0.1140
Guidance Scale = 8.5
SD-v1.5 0.4271 0.9825 0.3273 0.3925 0.7745 0.0320 0.0540
Ours-I 0.4269 0.9800 0.3616 0.3475 0.7702 0.0400 0.0620
Ours-II 0.4747 0.9700 0.5111 0.3675 0.8213 0.0620 0.1160
Guidance Scale = 9.5
SD-v1.5 0.4260 0.9825 0.3556 0.3825 0.7553 0.0280 0.0520
Ours-I 0.4190 0.9825 0.3495 0.3450 0.7447 0.0300 0.0620
Ours-II 0.4736 0.9700 0.5192 0.3625 0.8277 0.0560 0.1060

Appendix N More Comparisons

We provide the prompts for the images featured in the Fig. 1. Additionally, we provide more generated images for the comparison.

N.1 Prompts

Prompts of Fig. 1 from left to right are:

1. a portrait of an anthropomorphic cyberpunk raccoon smoking a cigar, cyberpunk!,
fantasy, elegant, digital painting, artstation, concept art, matte, sharp focus,
illustration, art by josan Gonzalez

2. Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach
whirlpool engine, volumetric lighting, spectacular, ambient lights,
light pollution, cinematic atmosphere, art nouveau style,
illustration art artwork by SenseiJaye, intricate detail.

3. tropical island, 8 k, high resolution, detailed charcoal drawing,
beautiful hd, art nouveau, concept art, colourful, in the style of vadym meller

4. anthropomorphic art of a fox wearing a white suit, white cowboy hat,
and sunglasses, smoking a cigar, texas inspired clothing by artgerm,
victo ngai, ryohei hase, artstation. highly detailed digital painting,
smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg

5. a painting of a lantina elder woman by Leonardo da Vinci . details, smooth,
sharp focus, illustration, realistic, cinematic, artstation, award winning, rgb ,
unreal engine, octane render, cinematic light, macro, depth of field, blur,
red light and clouds from the back, highly detailed epic cinematic concept art CG
render made in Maya, Blender and Photoshop, octane render, excellent composition,
dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse.

6. panda mad scientist mixing sparkling chemicals, high-contrast painting

7. An astronaut riding a horse on the moon, oil painting by Van Gogh.

8. A red dragon dressed in a tuxedo and playing chess. The chess pieces are
fashioned after robots.

N.2 Additional Image Comparisons

We provide more images for further comparisons. For each set of two rows, the top row displays images generated using the full-precision Stable Diffusion v1.5, while the bottom row features images generated from BitsFusion, where the weights of UNet are quantized into 1.99 bits and the model size is 7.9×7.9\times7.9 × smaller than the one from SD-v1.5. All the images are synthesized under the setting of using PNDM sampler with 50505050 sampling steps and random seed as 1024102410241024.

Refer to caption
a b c d e f
Figure 17: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: A person standing on the desert, desert waves, gossip illustration, half red, half blue, abstract image of sand, clear style, trendy illustration, outdoor, top view, clear style, precision art, ultra high definition image; b: A detailed oil painting of an old sea captain, steering his ship through a storm. Saltwater is splashing against his weathered face, determination in his eyes. Twirling malevolent clouds are seen above and stern waves threaten to submerge the ship while seagulls dive and twirl through the chaotic landscape. Thunder and lights embark in the distance, illuminating the scene with an eerie green glow.; c: A solitary figure shrouded in mists peers up from the cobble stone street at the imposing and dark gothic buildings surrounding it. an old-fashioned lamp shines nearby. oil painting.; d: A deep forest clearing with a mirrored pond reflecting a galaxy-filled night sky; e: a handsome 24 years old boy in the middle with sky color background wearing eye glasses, it’s super detailed with anime style, it’s a portrait with delicated eyes and nice looking face; f: A dog that has been meditating all the time.
Refer to caption
a b c d e f
Figure 18: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: A small cactus with a happy face in the Sahara desert.; b: A middle-aged woman of Asian descent, her dark hair streaked with silver, appears fractured and splintered, intricately embedded within a sea of broken porcelain. The porcelain glistens with splatter paint patterns in a harmonious blend of glossy and matte blues, greens, oranges, and reds, capturing her dance in a surreal juxtaposition of movement and stillness. Her skin tone, a light hue like the porcelain, adds an almost mystical quality to her form.; c: A high contrast portrait photo of a fluffy hamster wearing an orange beanie and sunglasses holding a sign that says "Let’s PAINT!”; d: An extreme close-up of an gray-haired man with a beard in his 60s, he is deep in thought pondering the history of the universe as he sits at a cafe in Paris, his eyes focus on people offscreen as they walk as he sits mostly motionless, he is dressed in a wool coat suit coat with a button-down shirt , he wears a brown beret and glasses and has a very professorial appearance, and the end he offers a subtle closed-mouth smile as if he found the answer to the mystery of life, the lighting is very cinematic with the golden light and the Parisian streets and city in the background, depth of field, cinematic 35mm film.; e: poster of a mechanical cat, techical Schematics viewed from front and side view on light white blueprint paper, illustartion drafting style, illustation, typography, conceptual art, dark fantasy steampunk, cinematic, dark fantasy; f: I want to supplement vitamin c, please help me paint related food.
Refer to caption
a b c d e f
Figure 19: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: new cyborg with cybertronic gadgets and vr helmet, hard surface, beautiful colours, sharp textures, shiny shapes, acid screen, biotechnology, tim hildebrandt, bruce pennington, donato giancola, larry elmore, masterpiece, trending on artstation, featured on pixiv, cinematic composition, dramatic pose, beautiful lighting, sharp, details, hyper - detailed, hd, hdr, 4 k, 8 k; b: portrait of teenage aphrodite, light freckles, curly copper colored hair, smiling kindly, wearing an embroidered white linen dress with lace neckline, intricate, elegant, mother of pearl jewelry, glowing lights, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by wlop, mucha, artgerm, and greg Rutkowski; c: portrait of a dystopian cute dog wearing an outfit inspired by the handmaid � s tale ( 2 0 1 7 ), intricate, headshot, highly detailed, digital painting, artstation, concept art, sharp focus, cinematic lighting, digital painting, art by artgerm and greg rutkowski, alphonse mucha, cgsociety; d: Portrait of a man by Greg Rutkowski, symmetrical face, a marine with a helmet, using a VR Headset, Kubric Stare, crooked smile, he’s wearing a tacitcal gear, highly detailed portrait, scifi, digital painting, artstation, book cover, cyberpunk, concept art, smooth, sharp foccus ilustration, Artstation HQ; e: Film still of female Saul Goodman wearing a catmaid outfit, from Red Dead Redemption 2 (2018 video game), trending on artstation, artstationHD, artstationHQ; f: oil paining of robotic humanoid, intricate mechanisms, highly detailed, professional digital painting, Unreal Engine 5, Photorealism, HD quality, 8k resolution, cinema 4d, 3D, cinematic, professional photography, art by artgerm and greg rutkowski and alphonse mucha and loish and WLOP
Refer to caption
a b c d e f
Figure 20: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: anthropomorphic tetracontagon head in opal edgy darknimite mudskipper, intricate, elegant, highly detailed animal monster, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm, bob eggleton, michael whelan, stephen hickman, richard corben, wayne barlowe, trending on artstation and greg rutkowski and alphonse mucha, 8 k; b: background shows moon, many light effects, particle, lights, gems, symmetrical!!! centered portrait dark witch, large cloak, fantasy forest landscape, dragon scales, fantasy magic, undercut hairstyle, short purple black fade hair, dark light night, intricate, elegant, sharp focus, digital painting, concept art, matte, art by wlop and artgerm and greg rutkowski and alphonse mucha, masterpiece; c: cat seahorse fursona, autistic bisexual graphic designer and musician, long haired attractive androgynous fluffy humanoid character design, sharp focus, weirdcore voidpunk digital art by artgerm, akihiko yoshida, louis wain, simon stalenhag, wlop, noah bradley, furaffinity, artstation hd, trending on deviantart; d: concept art of ruins of a victorian city burning down by j. c. leyendecker, wlop, ruins, dramatic, octane render, epic painting, extremely detailed, 8 k; e: hyperrealistic Gerald Gallego as a killer clown from outer space, trending on artstation, portrait, sharp focus, illustration, art by artgerm and greg rutkowski and magali Villeneuve; f: low angle photo of a squirrel dj wearing on - ear headphones and colored sunglasses, stadning at a dj table playing techno music at a dance club, hyperrealistic, highly detailed, intricate, smoke, colored lights, concept art, digital art, oil painting, character design by charlie bowater, ross tran, artgerm, makoto shinkai, wlop
Refer to caption
a b c d e f
Figure 21: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: a photograph of an ostrich wearing a fedora and singing soulfully into a microphone; b: a pirate ship landing on the moon; c: a pumpkin with a candle in it; d: a rabbit wearing a black tophat and monocle; e: a red sports car on the road; f: a robot cooking in the kitchen.
Refer to caption
a b c d e f
Figure 22: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: a baby daikon radish in a tutu; b: a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants; c: a woman with long black hair and dark skin; d: an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants; e: a blue sports car on the road; f: a butterfly.
Refer to caption
a b c d e f
Figure 23: Top: Images generated from full-precision Stable Diffusion v1.5. Bottom: Images generated from BitsFusion. Prompts from left to right are: a: Helmet of a forgotten Deity, clowing corals, extremly detailed digital painting, in the style of Fenghua Zhong and Ruan Jia and jeremy lipking and Peter Mohrbacher, mystical colors, rim light, beautiful lighting, 8k, stunning scene, raytracing, octane, trending on artstation; b: Jeff Bezos as a female amazon warrior, closeup, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, hearthstone, art by Artgerm and Greg Rutkowski and Alphonse Mucha; c: Portrait of a draconic humanoid, HD, illustration, epic, D&D, fantasy, intricate, elegant, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, art by artgerm and greg rutkowski and alphonse mucha, monster hunter illustrations art book; d: [St.Georges slaying a car adorned with checkered flag. Soviet Propaganda!!! poster!!!, elegant, highly detailed, digital painting, artstation, concept art, matte, sharp focus, illustration, octane render, unreal engine, photography]; e: a fire - breathing dragon at a medieval hobbit home, ornate, beautiful, atmosphere, vibe, mist, smoke, chimney, rain, wet, pristine, puddles, waterfall, clear stream, bridge, forest, flowers, concept art illustration, color page, 4 k, tone mapping, doll, akihiko yoshida, james jean, andrei riabovitchev, marc simonetti, yoshitaka amano, digital illustration, greg rutowski, volumetric lighting, sunbeams, particles; f: portrait of a well-dressed raccoon, oil painting in the style of Rembrandt