BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

Yang Sui^1,2,† Yanyu Li¹ Anil Kag¹ Yerlan Idelbayev¹ Junli Cao¹ Ju Hu¹
Dhritiman Sagar¹ Bo Yuan² Sergey Tulyakov¹ Jian Ren^1,∗
¹Snap Inc. ²Rutgers University
Project Page: https://snap-research.github.io/BitsFusion

Abstract

Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to $1.99$ bits, achieving a model with $7.9\times$ smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.

²²footnotetext: Work done during an internship at Snap Inc.¹¹footnotetext: Corresponding author.

Refer to caption — Figure 1: *Top*: Images generated from full-precision Stable Diffusion v1.5. *Bottom*: Images generated from BitsFusion, where the weights of UNet are quantized into *1.99* bits, achieving $7.9\times$ *smaller* storage than the one from Stable Diffusion v1.5. All the images are synthesized under the setting of using PNDM sampler [45] with $50$ sampling steps and random seed as $1024$ . Prompts and more generations are provided in App. N.

1 Introduction

Recent efforts in developing diffusion-based image generation models [69, 31, 71, 21, 72] have demonstrated remarkable results in synthesizing high-fidelity and photo-realistic images, leading to various applications such as content creation and editing [60, 59, 53, 63, 81, 79, 46, 39], video generation [20, 67, 3, 1, 16, 52, 15], and 3D asset synthesis [78, 42, 56, 66, 57], among others. However, Diffusion Models (DMs) come with the drawback of a large number of parameters, e.g., millions or even billions, causing significant burdens for transferring and storing due to the bulky model size, especially on resource-constrained hardware such as mobile and wearable devices.

Existing studies have explored reducing the model size of large-scale text-to-image diffusion models by designing efficient architectures and network pruning [40, 82, 32]. These approaches usually require significant amounts of training due to the changes made to the pre-trained networks. Another promising direction for model storage reduction is quantization [12, 30], where floating-point weights are converted to low-bit fixed-point representations, thereby saving computation memory and storage.

There have been emerging efforts on compressing the DMs through quantization [65, 37, 17, 38]. However, these approaches still face several major challenges, especially when quantizing large-scale text-to-image diffusion models like Stable Diffusion v1.5 (SD-v1.5) [62]. First, many of these methods are developed on relatively small-scale DMs trained on constrained datasets. For example, models trained on CIFAR-10 require modest storage of around $100$ MB [21, 38]. In contrast, SD-v1.5 necessitates $3.44$ GB of storage in a full-precision format. Adapting these methods to SD-v1.5 remains to be a challenging problem. Second, current arts mainly focus on quantizing weights to $4$ bits. How to quantize the model to extremely low bit is not well studied. Third, there is a lack of fair and extensive evaluation of how quantization methods perform on large-scale DMs, i.e., SD-v1.5.

To tackle the above challenges, this work proposes BitsFusion, a quantization-aware training framework that employs a series of novel techniques to compress the weights of pre-trained large-scale DMs into extremely low bits (i.e., 1.99 bits), achieving even better performance (i.e., higher image quality and better text-image alignment). Consequently, we compress the $1.72$ GB UNet (FP16)¹¹1For SD-v1.5, we measure the generation quality using the FP32 format. However, since SD-v1.5 FP16 has similar performance to SD-v1.5 FP32, we use SD-v1.5 FP16 to calculate our compression ratio. of SD-v1.5 into a $219$ MB model, achieving a $7.9\times$ compression ratio. Specifically, our contributions can be summarized into the following four dimensions:

•

Mixed-Precision Quantization for DMs. We propose an effective approach for quantizing DMs in a mixed-precision manner. First, we thoroughly analyze the appropriate metrics to understand the quantization error in the quantized DMs (Sec. 3.2). Second, based on the analysis, we quantize different layers into different bits according to their quantization error (Sec. 3.3).
•

Initialization for Quantized DMs. We introduce several techniques to initialize the quantized model to improve performance, including time embedding pre-computing and caching, adding balance integer, and alternating optimization for scaling factor initialization (Sec. 4.1).
•

Improved Training Pipeline for Quantized DMs. We improve the training pipeline for the quantized model with the proposed two-stage training approach (Sec. 4.2). In the first stage, we use the full-precision model as a teacher to train the quantized model through distillation. Our distillation loss forces the quantized model to learn both the predicted noise and the intermediate features from the teacher network. Furthermore, we adjust the distribution of time step sampling during training, such that the time steps causing larger quantization errors are sampled more frequently. In the second stage, we fine-tune the model using vanilla noise prediction [21].
•

Extensive Quantitative Evaluation. For the first time in the literature, we conduct extensive quantitative analysis to compare the performance of the quantized model against the original SD-v1.5. We include results on various benchmark datasets, i.e., TIFA [25], GenEval [13], CLIP score [58] and FID [19] on MS-COCO 2014 validation set [43]. Additionally, we perform human evaluation on PartiPrompts [77]. Our $1.99$ -bit weights quantized model consistently outperforms the full-precision model across all the evaluations, demonstrating the effectiveness of our approach.

2 Related Works

To enhance model efficiency in terms of storage and computational costs, quantization is adopted for diffusion models [65, 37, 18, 68, 73, 75, 47, 76, 4, 74, 7, 83, 27, 17, 38] with primarily two types: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ does not require a full training loop; instead, it utilizes a limited calibration dataset to adjust the quantization parameters. For example, PTQ4DM [65] calibrates the quantization parameters to minimize the quantization error of DMs. Q-Diffusion [37] minimizes the quantization error via the block-wise reconstruction [41]. PTQD [18] integrates quantization noise into the stochastic noise inherent in the sampling steps of DMs. TDQ [68] optimizes scaling factors for activations across different time steps, applicable to both PTQ and QAT strategies. TFMQ [27] focuses on reconstructing time embedding and projection layers to prevent over-fitting. However, PTQ often results in performance degradation compared to QAT, particularly when aiming for extremely low-bit DMs. In contrast, QAT involves training the full weights to minimize the quantization error, thereby achieving higher performance compared to PTQ. For instance, EfficientDM [17], inspired by LoRA [24], introduces a quantization-aware low-rank adapter to update the LoRA weights, avoiding training entire weights. Q-DM [38] employs normalization and smoothing operation on attention features through proposed Q-attention blocks, enhancing quantization performance. Nevertheless, existing works primarily study $4$ bits and above quantization on small-scale DMs trained on constrained datasets. In this paper, we focus on quantizing large-scale Stable Diffusion to extremely low bits and extensively evaluating the performance across different benchmark datasets.

3 Mixed Precision Quantization for Diffusion Models

In this section, we first go through the formulations of weight quantization and generative diffusion models. We then determine the mixed-precision strategy, assigning optimized bit widths to different layers to reduce the overall quantization error. Specifically, we first analyze the quantization error of each layer in the diffusion model and conclude sensitivity properties, Then, based on the analysis, we assign appropriate bits to each layer by jointly considering parameter efficiency (i.e., size savings).

3.1 Preliminaries

Quantization is a popular and commonly used technique to reduce model size. While many quantization forms exist, we focus on uniform quantization, where full-precision values are mapped into discrete integer values as follows:

\displaystyle\bm{\theta}_{\texttt{int}}

\displaystyle=\texttt{Clip}(\lfloor\frac{{\bm{\theta}_{\texttt{fp}}}}{\mathbf{% s}}\rceil+I_{z},0,2^{b}-1),

(1)

where $\bm{\theta}_{\texttt{fp}}$ denotes the floating-point weights, $\bm{\theta}_{\texttt{int}}$ is the quantized integer weights, $\mathbf{s}$ is the scaling factor, $I_{z}$ is the zero point, and $b$ is the quantization bit-width. $\lfloor\cdot\rceil$ denotes the nearest rounding operation and $\texttt{Clip}(\cdot)$ denotes the clipping operation that constrains $\bm{\theta}_{\texttt{int}}$ within the target range. Following the common settings [37, 17], we apply the channel-wise quantization and set $8$ bits for the first and last convolutional layer of the UNet.

Stable Diffusion. Denoising diffusion probabilistic models [69, 21] learn to predict real data distribution $\mathbf{x}\sim p_{\text{data}}$ by reversing the ODE flow. Specifically, given a noisy data sample $\mathbf{z}_{t}=\alpha_{t}\mathbf{x}+\sigma_{t}{\bm{\epsilon}}$ ( $\alpha_{t}$ and $\sigma_{t}$ are SNR schedules and $\boldsymbol{\epsilon}$ is the added ground-truth noise), and a quantized denoising model $\bm{\hat{\epsilon}_{\theta_{\texttt{int},s}}}$ parameterized by $\bm{\theta}_{\texttt{int}}$ and $\mathbf{s}$ , the learning objective can be formulated as follows,

\displaystyle\mathcal{L}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}

\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\|\boldsymbol{\epsilon}-% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}(t,% \mathbf{z}_{t},\mathbf{c})\|\right],

(2)

where $t$ is the sampled time step and $\mathbf{c}$ is the input condition (e.g., text embedding). Note that during the training of quantized model, we optimize $\bm{\theta}_{\texttt{fp}}$ and $\mathbf{s}$ by backpropagating $\mathcal{L}_{\boldsymbol{\theta_{\texttt{int}}},\mathbf{s}}$ via Straight-Through Estimator (STE) [2] and quantize the weights to the integers for deployment. Here, for the notation simplicity, we directly use $\bm{\theta}_{\texttt{int}}$ to represent the optimized weights in the quantized models.

The latent diffusion model [62] such as Stable Diffusion conducts the denoising process in the latent space encoded by variational autoencoder (VAE) [34, 61], where the diffusion model is the UNet [9]. This work mainly studies the quantization for the UNet model, given it is the major bottleneck for the storage and runtime of the Stable Diffusion [40]. During the inference time, classifier-free guidance (CFG) [22] is usually applied to improve the generation,

\small\tilde{\bm{\epsilon}}_{\bm{\theta_{\texttt{int}}},\mathbf{s}}(t,\mathbf{% z}_{t},\mathbf{c})=w\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{int}},\mathbf{s}% }(t,\mathbf{z}_{t},\mathbf{c})-(w-1)\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{% int}},\mathbf{s}}(t,\mathbf{z}_{t},\varnothing),

(3)

where $w\geq 1$ and $\hat{\bm{\epsilon}}_{\bm{\theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{t},\varnothing)$ denotes the generation conditioned on the null text prompt $\varnothing$ .

3.2 Per-Layer Quantization Error Analysis

Obtaining Quantized Models. We first perform a per-layer sensitivity analysis for the diffusion model. Specifically, given a pre-trained full-precision diffusion model, we quantize each layer to $1$ , $2$ , and $3$ bits while freezing others at full-precision, and performing quantization-aware training (QAT) respectively. For instance, for the SD-v1.5 UNet with $256$ layers (excluding time embedding, the first and last layers), we get a total of $768$ quantized candidates. We perform QAT over each candidate on a pre-defined training sub dataset, and validate the incurred quantization error of each candidate by comparing it against the full-precision model (more details in App. B).

Measuring Quantization Errors. To find the appropriate way to interpret the quantization error, we analyze four metrics: Mean-Squared-Error (MSE) that quantifies the pixel-level discrepancies between images (generations from floating and the quantized model in our case), LPIPS [80] that assesses human-like perceptual similarity judgments, PSNR [23] that measures image quality by comparing the maximum possible power of a signal with the power of a corrupted noise, and CLIP score [58] that evaluates the correlation between an image and its language description. After collecting the scores (examples in Fig. 2(b) and Fig. 2(c), full metrics are listed in App. F), we further measure the consistency of them by calculating the Pearson correlation [8] for different metrics under the same bit widths (in Tab. 2), and different bit widths under the same metric (in Tab. 2). With these empirical results, we draw the following two main observations.

Observation 1: MSE, PSNR, and LPIPS show strong correlation and they correlate well with the visual perception of image quality.

Tab. 2 shows that MSE is highly correlated with PSNR and LPIPS under the same bit width. Additionally, we observe a similar trend of per-layer quantization error under different bit widths, as in Tab. 2. As for visual qualities in Fig. 2(a) and 2(b), we can see that higher MSE errors lead to severe image quality degradation, e.g., the highlighted RB conv shortcut. Therefore, the MSE metric effectively reflects quality degradations incurred by quantization, and it is unnecessary to incorporate PSNR and LPIPS further.

Observation 2: After low-bit quantization, changes in CLIP score are not consistently correlated with MSE across different layers. Although some layers show smaller MSE, they may experience larger semantic degradation, reflected in larger CLIP score changes.

We notice that, after quantization, the CLIP score changes for all layers only have a weak correlation with MSE, illustrated in Tab. 2. Some layers display smaller MSE but larger changes in CLIP score. For example, in Fig. 2(b), the MSE of CA tok layer (5_th highlighted layer (green) from left to right) is less than that of RB conv layer (6_th highlighted layer (orange) from left to right), yet the changes in CLIP score are the opposite. As observed in the first row of Fig. 2(a), compared to RB conv layer, quantizing this CA tok layer changes the image content from "a teddy bear" to "a person", which diverges from the text prompt A teddy bear on a skateboard in Times Square, doing tricks on a cardboard box ramp. This occurs because MSE measures only the difference between two images, which does not capture the semantic degradation. In contrast, the CLIP score reflects the quantization error in terms of semantic information between the text and image. Thus, we employ the CLIP score as a complementary metric to represent the quantization error.

Table 1: Pearson correlation (absolute value) of quantization error between different metrics (e.g., MSE vs. PSNR denotes the correlation between two metrics) when quantizing individual layers to 1, 2, and 3 bits. CS denotes CLIP Score.

	MSE vs. PSNR	MSE vs. LPIPS	MSE vs. CS
1 bit	0.870	0.984	0.733
2 bit	0.882	0.989	0.473
3 bit	0.869	0.991	0.535

Table 2: Pearson correlation (absolute value) of quantization error between different bit pairs (e.g., 1 vs. 2 denotes the correlation between the two bit widths) for a single metric when quantizing individual layers to 1, 2, and 3 bits.

	MSE	PSNR	LPIPS	CLIP Score
1 vs. 2 bit	0.929	0.954	0.943	0.504
1 vs. 3 bit	0.766	0.843	0.802	0.344
2 vs. 3 bit	0.887	0.923	0.895	0.428

3.3 Deciding the Optimal Precision

With the above observations, we then develop the strategy for bit-width assignments. We select MSE and CLIP as our quantitative metrics, along with the number of parameters of each layer as the indicator of size savings.

Assigning bits based on MSE. Intuitively, layers with more parameters and lower quantization error are better candidates for extremely low-bit quantization, as the overall bit widths of the model can be significantly reduced. According to this, we propose a layer size-aware sensitivity score $\mathcal{S}$ . For the $i_{th}$ layer, its sensitivity score for the $b$ -bits ( $b\in\{1,2,3\}$ ) is defined as $\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}$ , where $M$ denotes the MSE error, $N$ is the total number of parameters of the layer, and $\eta\in[0,1]$ denotes the parameter size factor. To determine the bit width (i.e., $b^{*}$ ) for each layer, we define a sensitivity threshold as $\mathcal{S}_{o}$ , and the $i_{th}$ layer is assigned to $b_{i}^{*}$ -bits, where $b_{i}^{*}=\min\{b|\mathcal{S}_{i,b}<\mathcal{S}_{o}\}$ . The remaining layers are $4$ bits.

Assigning bits based on CLIP score. For the layers with a high CLIP score dropping after quantization, instead of assigning bits based on sensitivity score as discussed above, we directly assign higher bits to those layers. Therefore, the quantized model can produce content that aligns with the semantic information of the prompt. We provide the detailed mixed-precision algorithm in Alg. 1 of App. B.

4 Training Extreme Low-bit Diffusion Model

With the bits of each layer decided, we then train the quantized model with a series of techniques to improve performance. The overview of our approach is illustrated in Fig. 3.

4.1 Initializing the Low-bit Diffusion Model

Time Embedding Pre-computing and Caching. During the inference time of a diffusion model, a time step $t$ is transformed into an embedding through projection layers to be incorporated into the diffusion model. As mentioned by existing works [27], the quantization of the projection layers can lead to large quantization errors. However, the embedding from each time step $t$ is always the same, suggesting that we can actually pre-compute the embedding offline and load cached values during inference, instead of computing the embedding every time. Furthermore, the storage size of the time embedding is $25.6\times$ smaller than the projection layers. Therefore, we pre-compute the time embedding and save the model without the project layers. More details are provided in App. C.

Adding Balance Integer. In general, weight distributions in deep neural networks are observed as symmetric around zero [84]. To validate the assumption on SD-v1.5, we analyze its weight distribution for the layers under full precision by calculating the skewness of weights. Notably, the skewness of more than $97\%$ of the layers ranges between $\left[-0.5,0.5\right]$ , indicating that the weight distributions are symmetric in almost all layers. Further details are provided in App. D.

However, existing works on diffusion model quantization overlook the symmetric property [37, 65, 38], as they perform relatively higher bits quantization, e.g., $4$ or $8$ bits. This will hurt the model performance at extremely low bit levels. For example, in $1$ -bit quantization, the possible most symmetric integer outcomes can only be $\{0,1\}$ or $\{-1,0\}$ . Similarly, for $2$ -bit quantization, the most balanced mapping integers can be either $\{-2,-1,0,1\}$ or $\{-1,0,1,2\}$ , significantly disrupting the symmetric property. The absence of a single value among $2$ or $4$ numbers under low-bit quantization can have a significant impact. To tackle this, we leverage the bit balance strategy [36, 51] to initialize the model. Specifically, we introduce an additional value to balance the original quantization values. Namely, in a 1-bit model, we adjust the candidate integer set from $\{0,1\}$ to $\{-1,0,1\}$ , achieving a more balanced distribution. By doing so, we treat the balanced $n$ -bits weights as $\texttt{log}(2^{n}+1)$ -bits.

Scaling Factor Initialization via Alternating Optimization. Initializing scaling factors is an important step in quantization. Existing QAT works typically employ the Min-Max initialization strategy [17, 48] to ensure the outliers are adequately represented and preserved. However, such a method faces challenges in extremely low-bit quantization settings like $1$ -bit, since the distribution of the full-precision weights is overlooked, leading to a large quantization error and the increased difficulty to converge. Therefore, we aim to minimize the $\ell_{2}$ error between the quantized weights and full-precision weights with the optimization objective as:

\small\min_{\mathbf{s}}\|\mathbf{s}\cdot(\bm{\theta}_{\texttt{int}}-I_{z})-\bm% {\theta}_{\texttt{fp}}\|^{2}.

(4)

Nevertheless, considering the rounding operation, calculating an exact closed-form solution is not straightforward [29]. Inspired by the Lloyd-Max algorithm [28, 49], we use an optimization method on scaling factor $\mathbf{s}$ to minimize the initialization error of our quantized diffusion model as follows:

\displaystyle\bm{\theta}_{\texttt{int}}^{j}=Q_{\texttt{int}}(\bm{\theta}_{% \texttt{fp}},\mathbf{s}^{j-1});\,\,\,\mathbf{s}^{j}=\frac{\bm{\theta}_{\texttt% {fp}}^{j}(\bm{\theta}_{\texttt{int}}^{j}-I_{z})^{\intercal}}{(\bm{\theta}_{% \texttt{int}}^{j}-I_{z})(\bm{\theta}_{\texttt{int}}^{j}-I_{z})^{\intercal}},

(5)

where $Q_{\texttt{int}}(\cdot)$ denotes the integer mapping quantization operation that converts the full-precision weights to integer as Eq. (1), and $j$ represents the iterative step. The optimization is done for $10$ steps.

4.2 Two-Stage Training Pipeline

With the mixed-precision model initialized, we introduce the two-stage training pipeline. In Stage-I, we train the quantized model using the full-precision model as the teacher through distillation loss. In Stage-II, we fine-tune the model from the previous stage using noise prediction [21, 72].

CFG-aware Quantization Distillation. Similar to existing works [11], we fine-tune the quantized diffusion model to improve the performance. Here both the weights and scaling factors are optimized. Additionally, we notice that training the quantized model in a distillation fashion using the full-precision model yields better performance than training directly with vanilla noise prediction. Furthermore, during distillation, it is crucial for the quantized model to be aware of CFG, i.e., text dropping is applied during distillation. Specifically, our training objective is as follows:

\displaystyle\mathcal{L}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}^{% \texttt{noise}}=\mathbb{E}_{t,\mathbf{x}}\left[\|\boldsymbol{\hat{\epsilon}}_{% \boldsymbol{\theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\boldsymbol{% \hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{% t},\mathbf{c})\|\right],\mathbf{c}=\varnothing\ \textrm{if}\ P\sim U[0,1]<p\ % \textrm{else}\ \mathbf{c},

(6)

where $P$ controls the text dropping probability during training and $p$ is set as $0.1$ .

Feature Distillation. To further improve the generation quality of the quantized model, we distill the full-precision model at a more fine-grained level through feature distillation [32] as follows:

\displaystyle\mathcal{L}_{\boldsymbol{\theta}_{\texttt{int}},\mathbf{s}}^{% \texttt{feat}}=\mathbb{E}_{t,\mathbf{x}}\left[\|\mathcal{F}_{\boldsymbol{% \theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\mathcal{F}_{\boldsymbol{% \theta}_{\texttt{int}},\mathbf{s}}(t,\mathbf{z}_{t},\mathbf{c})\|\right],

(7)

where $\mathcal{F}_{\boldsymbol{\theta}}(\cdot)$ denotes the operation for getting features from the Down and Up blocks in UNet. We then have the overall distillation loss $\mathcal{L}^{\texttt{dist}}$ in Stage-I as follows:

\displaystyle\mathcal{L}^{\texttt{dist}}=\mathcal{L}_{\boldsymbol{\theta}_{% \texttt{int}},\mathbf{s}}^{\texttt{noise}}+\lambda\mathcal{L}_{\boldsymbol{% \theta}_{\texttt{int}},\mathbf{s}}^{\texttt{feat}},

(8)

where $\lambda$ is empirically set as $0.01$ to balance the magnitude of the two loss functions.

Quantization Error-aware Time Step Sampling. The training of diffusion models requires sampling different time steps in each optimization iteration. We explore how to adjust the strategy for time step sampling such that the quantization error in each time step can be effectively reduced during training. We first train a $1.99$ -bit quantized model with Eq. (8). Then, we calculate the difference of the predicted latent features between the quantized model and the full-precision model as $\mathbb{E}_{t,\mathbf{x}}[\frac{1-\bar{\alpha}_{t}}{\bar{\alpha}_{t}}\|% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{fp}}}(t,\mathbf{z}_{% t},\mathbf{c})-\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int}}% ,\mathbf{s}}(t,\mathbf{z}_{t},\mathbf{c})\|^{2}]$ , where $t\in\left[0,1,\cdots,999\right]$ and $\bar{\alpha}_{t}$ is the noise scheduler (detailed derivation in App. E). The evaluation is conducted on a dataset with $128$ image-text pairs. Fig. 4 shows the quantization error does not distribute equally across all time steps. Notably, the quantization error keeps increasing as the time steps approach $t=999$ .

To mitigate the quantization error prevalent near the time steps $t=999$ , we propose a sampling strategy by utilizing a distribution specifically tailored to sample more time steps exhibiting the largest quantization errors, thereby enhancing performance. To achieve this goal, we leverage the Beta distribution. Specifically, time steps are sampled according to $t\sim\textit{Beta}(\alpha,\beta)$ , as shown in Fig. 4. We empirically set $\alpha=3.0$ and $\beta=1.0$ for the best performance. Combining the strategy of time steps sampling with Eq. (8), we conduct the Stage-I training.

Fine-tuning with Noise Prediction. After getting the model trained with the distillation loss in Stage-I, we then fine-tune it with noise prediction, as in Eq. (2), in Stage-II. We apply a text dropping with probability as $10\%$ and modify the distribution of time step sampling based on the quantization error, as introduced above. The reason we leverage two-stage fine-tuning, instead of combining Stage-I and Stage-II, is that we observe more stabilized training results.

5 Experiments

Implementation Details. We develop our code using diffusers library²²2https://github.com/huggingface/diffusers and train the models with AdamW optimizer [33] and a constant learning rate as $1$ e $-05$ on an internal dataset. For Stage-I, we use $8$ NVIDIA A100 GPUs with a total batch size of $256$ to train the quantized model for $20$ K iterations. For Stage-II, we use $32$ NVIDIA A100 GPUs with a total batch size of $1024$ to train the quantized model for $50$ K iterations. During inference, we adopt the PNDM scheduler [45] with $50$ sampling steps to generate images for comparison. Other sampling approaches (e.g., DDIM [70] and DPMSolver [50]) lead to the same conclusion (App. K).

Evaluation Metrics. We conduct evaluation on CLIP Score and FID on MS-COCO [44], TIFA [26], GenEval [14], and human evaluation on PartiPrompts [77]. We adopt ViT-B/32 model [10] in CLIP score and the Mask2Former(Swin-S-8 $\times$ 2) [5] in GenEval. App. I provides details for the metrics.

5.1 Main Results

Comparison with SD-v1.5. Our quantized $1.99$ -bits UNet consistently outperforms the full-precision model across all metrics.

•

$30$ K MS-COCO 2014 Validation Set. For the CLIP score, as demonstrated in Fig. 5(a), attributed to the proposed mixed-precision recipe with the introduced initialization techniques and advanced training schemes in Stage-I, our $1.99$ -bits UNet, with a storage size of $219$ MB, achieves performance comparable to the original SD-v1.5. Following Stage-II training, our model surpasses the performance of the original SD-v1.5. With CFG scales ranging from $2.5$ to $9.5$ , our model yields $0.002\sim 0.003$ higher CLIP scores. We provide a discussion of FID in Appendix L.
•

TIFA. As shown in Fig. 5(b), our $1.99$ -bits model with Stage-I training performs comparably to the SD-v1.5. With the Stage-II training, our model achieves better metrics over the SD-v1.5.
•

GenEval. We show the comparison results for GenEval in Fig. 5(c) (detailed comparisons of GenEval score are presented in Appn. M). Our model outperforms SD-v1.5 for all CFG scales.
•

Human Evaluation. More users prefer the images generated by our quantized model over SD-v1.5, with the ratio as $54.4\%$ . We provide a detailed comparison in App. J.

Comparison with Other Quantization Approaches. Additionally, we conduct the experiments by comparing our approach with other works including LSQ [11], Q-Diffusion [37], EfficientDM [17], and Apple-MBP [54], as shown in Tab. 4. Our model achieves a higher CLIP score compared with all other works and better performance than SD-v1.5.

Table 3: Comparison with existing quantization methods, including LSQ [11], Q-Diffusion [37], EfficientDM [17], and Apple-MBP [54]. The CLIP score is measured on 1K PartiPrompts.

Method	Bit-width	CLIP score
SD-v1.5	32	0.3175
LSQ	2	0.2849
Q-Diffusion	4	0.3137
EfficientDM	2	0.2918
Apple-MBP	2	0.3023
Ours	1.99	0.3212

Table 4: Analysis of our proposed methods measured under various CFG scales, i.e.,

3.5

5.5

7.5

, and

9.5

. We use LSQ [11] as the basic QAT method, which involves the training of weights and scaling factors of a uniformly

2

-bit quantized UNet. Then, we gradually introduce each proposed technique to evaluate their effectiveness. CLIP scores are measured on 1K PartiPrompts.

Method	Bit-width	3.5	5.5	7.5	9.5	Average	$\Delta$
SD-v1.5	32	0.3110	0.3159	0.3175	0.3180	0.3156	-
QAT-Base	2	0.2679	0.2793	0.2849	0.2868	0.2797	-
+Balance	2.32	0.2990	0.3059	0.3080	0.3086	0.3054	+0.0257
+Alternating Opt.	2.32	0.3061	0.3108	0.3117	0.3115	0.3100	+0.0046
+Mixed/Caching	1.99	0.3055	0.3129	0.3142	0.3145	0.3118	+0.0018
+Feat Dist.	1.99	0.3086	0.3147	0.3167	0.3169	0.3142	+0.0024
+Time Sampling	1.99	0.3098	0.3159	0.3181	0.3184	0.3156	+0.0014
+Fine-tuning	1.99	0.3163	0.3192	0.3212	0.3205	0.3183	+0.0027

5.2 Ablation Analysis

Here we perform extensive analysis for our proposed method. We mainly evaluate different experimental settings using the CLIP score measured on 1K PartiPrompts [77].

Analysis of the Proposed Techniques. We adopt the LSQ [11] as the basic QAT method to update the weights and scaling factors of a uniform 2-bit UNet with Min-Max initialization. Results are presented in Tab. 4 with the following details:

•

+Balance. By adding a balance integer, a $2$ -bit model that typically represents $4$ integer values can now represent $5$ integers, becoming a $2.32$ -bit model by $\texttt{log}(4+1)$ bits. The average CLIP score has significantly increased from $0.2797$ to $0.3054$ .
•

+Alternating Opt. By further utilizing the scaling factor initialization via alternating optimization, the average CLIP score of the $2.32$ -bit model increases to $0.3100$ .
•

+Mixed/Caching. By leveraging time embedding pre-computing and caching, we minimize the storage requirements for time embedding and projection layers by only retaining the calculated features. This significantly reduces the averaged bits. Combined with our mixed-precision strategy, this approach reduces the average bits from $2.32$ to $1.99$ bits and can even improve the performance, i.e., CLIP score improved from $0.3100$ to $0.3118$ .
•

+Feat Dist. By incorporating the feature distillation loss, i.e., Eq. (7), the model can learn more fine-grained information from the teacher model, improving CLIP score from $0.3118$ to $0.3142$ .
•

+Time Sampling. By employing a quantization error-aware sampling strategy at various time steps, the model focuses more on the time step near $t=999$ . With this sampling strategy, our $1.99$ -bits model performs very closely to, or even outperforms, the original SD-v1.5.
•

+Fine-tuning. By continuing with Stage-II training that incorporates noise prediction, our $1.99$ -bits model consistently outperforms the SD-v1.5 across various guidance scales, improving the CLIP score to $0.3183$ .

Table 5: Analysis of

\eta

in the mixed-precision strategy.

$\eta$	0	0.1	0.2	0.3	0.4	0.5
CLIP score	0.3155	0.3173	0.3162	0.3181	0.3171	0.3168

Table 6: Anlysis of

\lambda

in distillation loss.

$\lambda$	1	0.1	0.01
CLIP score	0.3164	0.3159	0.3181

Table 7: Analysis of

\alpha

in time step-aware sampling.

$\alpha$	1.5	2.0	3.0
CLIP score	0.3169	0.3173	0.3181

Effect of $\eta$ in Mixed-Precision Strategy. Tab. 7 illustrates the impact of the parameter size factor $\eta$ (as discussed in Sec. 3.3) in determining the optimal mixed precision strategy. We generate six different mixed precision recipes with different $\eta$ with 20K training iterations for comparisons. Initially, we explore the mixed precision strategy determined with and without the parameter size factor. Setting $\eta=0$ results in $N^{-\eta}=1$ , indicating that the mixed precision is determined without considering the impact of parameter size. The results show that neglecting the parameter size significantly degrades performance. Further, we empirically choose $\eta=0.3$ in our experiments after comparing different values of $\eta$ .

Effect of $\lambda$ of Distillation Loss. Tab. 7 illustrates the impact of the balance factor $\lambda$ for loss functions in Eq. (8). We empirically choose $\lambda=0.01$ in our experiments after comparing the performance.

Effect of $\alpha$ in Time Step-aware Sampling Strategy. Tab. 7 illustrates the impact of the $\alpha$ for different Beta sampling distribution. As analyzed in Sec. 4.2, the quantization error increases near $t=999$ . To increase sampling probability near this time step, Beta distribution requires $\alpha>1$ with $\beta=1$ . A larger $\alpha$ enhances the sampling probability near $t=999$ . Compared to $\alpha=1.5$ and $\alpha=2.0$ , $\alpha=3.0$ concentrates more on later time steps and achieves the best performance. We choose $\alpha=3.0$ in our experiments.

Analysis for Different Schedulers. One advantage of our training-based quantization approach is that our quantized model consistently outperforms SD-v1.5 for various sampling approaches. We conduct extensive evaluations on TIFA to show we achieve better performance than SD-v1.5 for using both DDIM [70] and DPMSolver [50] to perform the sampling. More details are shown in App. K.

6 Conclusion

To enhance the storage efficiency of the large-scale diffusion models, we introduce an advanced weight quantization framework, BitsFusion, which effectively compresses the weights of UNet from SD-v1.5 to $1.99$ bits, achieving a $7.9\times$ smaller model size. BitsFusion even outperforms SD-v1.5 in terms of generation quality. Specifically, we first conduct a comprehensive analysis to understand the impact of each layer during quantization and establish a mixed-precision strategy. Second, we propose a series of effective techniques to initialize the quantized model. Third, during the training stage, we enforce the quantized model to learn the full-precision SD-v1.5 by using distillation losses with the adjusted distribution of time step sampling. Finally, we fine-tune the previous quantized model through vanilla noise prediction. Our extensive evaluations on TIFA, GenEval, CLIP score, and human evaluation consistently demonstrate the advantage of BitsFusion over full-precision SD-v1.5.

References

[1] Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.
[2] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[3] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
[4] Hanwen Chang, Haihao Shen, Yiyang Cai, Xinyu Ye, Zhenzhong Xu, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, Yintong Lu, and Heng Guo. Effective quantization for diffusion models on cpus. arXiv preprint arXiv:2311.16133, 2023.
[5] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
[6] Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3043–3054, 2023.
[7] Huanpeng Chu, Wei Wu, Chengjie Zang, and Kun Yuan. Qncd: Quantization noise correction for diffusion models. arXiv preprint arXiv:2403.19140, 2024.
[8] Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. Noise reduction in speech processing, pages 1–4, 2009.
[9] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
[11] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. In International Conference on Learning Representations, 2019.
[12] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
[13] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
[14] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. Advances in Neural Information Processing Systems, 36, 2024.
[15] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
[16] Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
[17] Yefei He, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Efficientdm: Efficient quantization-aware fine-tuning of low-bit diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
[18] Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
[19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[20] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
[21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
[22] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
[23] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
[24] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
[25] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
[26] Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023.
[27] Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, and Xianglong Liu. Tfmq-dm: Temporal feature maintenance quantization for diffusion models. arXiv preprint arXiv:2311.16503, 2023.
[28] Kyuyeon Hwang and Wonyong Sung. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6. IEEE, 2014.
[29] Yerlan Idelbayev, Pavlo Molchanov, Maying Shen, Hongxu Yin, Miguel A Carreira-Perpinán, and Jose M Alvarez. Optimal quantization using scaled codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12095–12104, 2021.
[30] Qing Jin, Jian Ren, Richard Zhuang, Sumant Hanumante, Zhengang Li, Zhiyu Chen, Yanzhi Wang, Kaiyuan Yang, and Sergey Tulyakov. F8net: Fixed-point 8-bit only multiplication for network quantization. arXiv preprint arXiv:2202.05239, 2022.
[31] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
[32] Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
[33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[35] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
[36] Fengfu Li, Bin Liu, Xiaoxing Wang, Bo Zhang, and Junchi Yan. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
[37] Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17535–17545, 2023.
[38] Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-dm: An efficient low-bit quantized diffusion model. Advances in Neural Information Processing Systems, 36, 2024.
[39] Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Textcraftor: Your text encoder can be image quality controller. arXiv preprint arXiv:2403.18978, 2024.
[40] Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. Advances in Neural Information Processing Systems, 36, 2024.
[41] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2020.
[42] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
[43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[45] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
[46] Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, and Sergey Tulyakov. Hyperhuman: Hyper-realistic human generation with latent structural diffusion. arXiv preprint arXiv:2310.08579, 2023.
[47] Xuewen Liu, Zhikai Li, Junrui Xiao, and Qingyi Gu. Enhanced distribution alignment for post-training quantization of diffusion models. arXiv preprint arXiv:2401.04585, 2024.
[48] Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
[49] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
[50] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
[51] Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
[52] Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. arXiv preprint arXiv:2402.14797, 2024.
[53] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[54] Atila Orhon, Michael Siracusa, and Aseem Wadhwa. Stable diffusion with core ml on apple silicon, 2022.
[55] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2023.
[56] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
[57] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
[58] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[59] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[60] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
[61] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
[62] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[63] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
[64] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
[65] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
[66] Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. In The Twelfth International Conference on Learning Representations, 2023.
[67] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
[68] Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park. Temporal dynamic quantization for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
[69] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
[70] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[71] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
[72] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
[73] Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Zewen Wu, Yansong Tang, and Wenwu Zhu. Post-training quantization with progressive calibration and activation relaxing for text-to-image diffusion models. arXiv preprint arXiv:2311.06322, 2023.
[74] Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Towards accurate data-free quantization for diffusion models. arXiv preprint arXiv:2305.18723, 2023.
[75] Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junyi Wu, and Yan Yan. Quest: Low-bit diffusion model quantization via efficient selective finetuning. arXiv preprint arXiv:2402.03666, 2024.
[76] Yuewei Yang, Xiaoliang Dai, Jialiang Wang, Peizhao Zhang, and Hongbo Zhang. Efficient quantization strategies for latent diffusion models. arXiv preprint arXiv:2312.05431, 2023.
[77] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022.
[78] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978, 2022.
[79] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[80] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
[81] Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris N Metaxas, and Jian Ren. Sine: Single image editing with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6027–6037, 2023.
[82] Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023.
[83] Xingyu Zheng, Haotong Qin, Xudong Ma, Mingyuan Zhang, Haojie Hao, Jiakai Wang, Zixiang Zhao, Jinyang Guo, and Xianglong Liu. Binarydm: Towards accurate binarization of diffusion model. arXiv preprint arXiv:2404.05662, 2024.
[84] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. In International Conference on Learning Representations, 2016.

\doparttoc\faketableofcontents

Appendix

\parttoc

Appendix A Limitations

In this work, we study the storage size reduction of the UNet in Stable Diffusion v1.5 through weight quantization. The compression of VAE and CLIP text encoder [58] is also an interesting direction, which is not explored in this work. Additionally, our weight quantization techniques could be extended to the activations quantization, as a future exploration.

Appendix B More details for Mixed-Precision Algorithm

In Sec. 3, we analyze the per-layer quantization error and develop the mixed-precision strategy. Here, we provide the detailed algorithm as outlined in Alg. 1. The inputs include: a pre-defined candidate set of bit-width $b\in\{1,2,3\}$ , the full-precision SD-v1.5 $D$ , the total number of layers $L$ (except for the time embedding, time projection, the first and last convolutional layers), the training dataset $X$ , the number of training iterations $T$ , the number of evaluation images for calculating metrics $K$ , the bit threshold $\mathcal{S}_{o}$ , the parameter size factor $\eta$ , and the number of parameters of the $i_{th}$ layer $N_{i}$ .

In the first stage, we aim to obtain quantized models by quantizing each individual layer. Given the full-precision SD-v1.5 UNet $D$ , we consecutively perform the quantization on every single layer to $1$ , $2$ , or $3$ bits individually, while maintaining the remaining layers at FP32 format. Notice, to align with our experiments, we add the balance integer and initialize the scaling factor with our alternating optimization. For each quantized model, the weights and scaling factors are fine-tuned using quantization-aware training to minimize the quantization error by learning the predicted noise of the SD-v1.5. We obtain quantized models $D_{i,b},i=1,2,\cdots,L,b=1,2,3$ .

In the second stage, we measure the quantization error of each layer by calculating various metrics from comparing images generated by the quantized model $D_{i,b}$ with those from the unquantized SD-v1.5 $D$ . Specifically, we generate $K=100$ baseline images $I_{d}$ from the full-precision SD-v1.5 model with PartiPrompts. Then, for each quantized model $D_{i,b}$ , we use identical prompts and seed to generate corresponding images $I_{i,b}$ . We calculate the quantization error by measuring the metrics including MSE, CLIP score, PSNR, and LPIPS using these images and prompts.

In the third stage, we collect the mixed-precision recipe. We first compute a sensitivity score for each layer, factoring in both the MSE and the parameter size adjusted by $\eta$ . For the $i_{th}$ layer, its sensitivity score for the $b$ -bits ( $b\in\{1,2,3\}$ ) is defined as $\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}$ , where $M$ denotes the MSE error, $N$ is the total number of parameters of the layer, and $\eta\in[0,1]$ denotes the parameter size factor. To determine the bit width (i.e., $b^{*}$ ) for each layer, we define a sensitivity threshold as $\mathcal{S}_{o}$ , and the $i_{th}$ layer is assigned to $b_{i}^{*}$ -bits, where $b_{i}^{*}=\min\{b|\mathcal{S}_{i,b}<\mathcal{S}_{o}\}$ . The remaining layers are set as 4 bits if they fail to meet the threshold. After determining the initial bits based on the MSE error, we refine this recipe by considering the degradation in the CLIP score associated with each bit-width. We simply consider the CLIP score change at 3 bits. We assign layers with the highest $10\%,5\%,2\%$ CLIP score drop with 1, 2, 3 more bits, respectively.

The final output is a mixed-precision recipe $\{b_{i}^{*}\},i=1,2,\cdots,L$ , specifying the bit-width for each layer. Then, we set the first and last convolutional layers as $8$ bits and pre-computing and caching the time embedding and projection layers.

Algorithm 1 Mixed-Precision Algorithm

Input: Candidate bits set $b\in\{1,2,3\}$ , SD-v1.5 model $D$ , number of total layers $L$ (except for the time embedding, time projection, the first and last convolutional layers), dataset $X$ , training iterations $T$ , number of evaluation images $K$ , threshold $S_{o}$ , parameter size factor $\eta$ , number of parameters of the $i_{th}$ layer $N_{i}$ .

Output: Mixed precision recipe $\{b_{i}^{*}\},i=1,2,\cdots,L$ .

1: 1: Obtaining the quantized models.

2: for

b=1

3

3: for

i=1

L

4: Quantize the

i

-th layer to

b

bits via Eq. (1) and proposed initialization methods in Sec. 4.1 to get model

D_{i,b}

;

5: for

t=1

T

6: Updating weights and scaling factors by minimizing the quantization error using quantization-aware training on

D_{i,b}

with data

X

;

7: end for

8: end for

9: end for

10: 2: Calculating quantization error metrics.

11: Generating

K

images

I_{d}

via SD-v1.5;

12: for

b=1

3

13: for

i=1

L

14: Generating

K

images

I_{i,b}

via quantized model

D_{i,b}

;

15: Calculating MSE,

M_{i,b}

via images

I_{i,b}

and

I_{d}

;

16: Calculating PSNR,

P_{i,b}

via images

I_{i,b}

and

I_{d}

;

17: Calculating LPIPS,

L_{i,b}

via images

I_{i,b}

and

I_{d}

;

18: Calculating CLIP score drop,

C_{i,b}

via images

I_{i,b}

and prompts;

19: end for

20: end for

21: 2: Deciding the optimal precision.

22: Calculating sensitivity score

\mathcal{S}_{i,b}={M}_{i,b}{N}^{-\eta}_{i}

;

23: for

i=1

L

24:

b_{i}^{*}\leftarrow 4

;

25: for

b=3

1

26: if

\mathcal{S}_{i,b}<\mathcal{S}_{o}

then

27: Assign the

i

-th layer with

b

bits with

b_{i}^{*}\leftarrow b

;

28: end if

29: end for

30: end for

31: Calculating CLIP score drop,

C_{i,3}

and its

p_{th}

percentile

C_{p}

;

32: for

i=1

L

33: if

{C}_{i,3}>C_{90}

then

34:

b_{i}^{*}\leftarrow b_{i}^{*}+1

;

35: end if

36: if

{C}_{i,3}>C_{95}

then

37:

b_{i}^{*}\leftarrow b_{i}^{*}+1

;

38: end if

39: if

{C}_{i,3}>C_{98}

then

40:

b_{i}^{*}\leftarrow b_{i}^{*}+1

;

41: end if

42: end for

Appendix C More Details for Time Embedding Pre-computing and Caching

In Sec. 4.1, we introduce "Time Embedding Pre-computing and Caching". Here, we provide more details for the algorithm. In the Stable Diffusion model, the time step $t\in\left[0,1,\cdots,999\right]$ is transformed into a time embedding $\texttt{emb}_{t}$ through the equation $\texttt{emb}_{t}=e(t)$ , where $e(t)$ denotes the time embedding layer and $\texttt{emb}_{t}\in\mathbb{R}^{d_{te}}$ . In SD-v1.5, $d_{te}=1280$ . Then, for each ResBlock, denoted as $R_{i}$ for $i=1,2,\cdots,N_{r}$ , where $N_{r}$ is total number of ResBlocks with time projection layers, the $\texttt{emb}_{t}$ is encoded by time projection layers $r_{i}(\cdot)$ by $\texttt{F}_{i,t}=r_{i}(\texttt{emb}_{t})$ . Notice that $r_{i}(\cdot)$ and $e(\cdot)$ are both linear layers. Finally, $\texttt{F}_{i,t}$ is applied to the intermediate activations of each $R_{i}$ via addition operation, effectively incorporating temporal information into the Stable Diffusion model.

As observed before [27], time embedding and projection layers exhibit considerable sensitivity to quantization during PTQ on DM. To address this problem, existing work specifically pays attention to reconstructing layers related to time embedding [27]. In this study, we propose a more effective method. We observe that 1) during the inference stage, for each time step $t$ , the $\texttt{emb}_{t}$ and consequently $\texttt{F}_{i,t}$ remain constant. 2) In the Stable Diffusion model, the shape of $\texttt{F}_{i,t}$ are considerably smaller compared to time embedding and projection layers. Specifically, in SD-v1.5, $\texttt{F}_{i,t}$ is with the dimension in $\{320,640,1280\}$ which is largely smaller than time projection layers $W_{r}\in\mathbb{R}^{D\times 1280}$ , where $D\in\{320,640,1280\}$ . Therefore, we introduce an efficient and lossless method named Time Embedding Pre-computing and Caching. Specifically, for total $T_{\texttt{inf}}$ inference time steps, we opt to store only $T_{\texttt{inf}}$ time features, rather than retaining the original time embedding layers $e(\cdot)$ and the time projection layers in the $i$ -th ResBlock $r_{i}(\cdot)$ .

The inference time steps are set as 50 or less in most Stable Diffusion models. This method significantly reduces more than $1280/50=25.6\times$ storage requirements and entire computational costs in terms of time-related layers. Given that the storage size of the pre-computed $\texttt{F}_{i,t}$ is substantially smaller than that of the original linear layers, this approach effectively diminishes the average bit of our quantized model without any performance degradation.

Appendix D Analysis of Symmetric Weight Distribution

In Sec. 4.1, we introduce "Adding Balance Integer" by assuming the weight distribution in Stable Diffusion is symmetric. Here, we provide more analysis for the assumption. To verify the weight distribution is symmetric around zero in SD-v1.5, we measure the skewness of the weight distribution of each layer. Lower skewness indicates a more symmetric weight distribution. As illustrated in Fig. 6, $97\%$ of layers exhibiting skewness between [-0.5, 0.5], this suggests that most layers in SD-v1.5 have symmetric weight distributions.

Appendix E More Details for Quantization Error Across Different Time Steps

In Sec. 4.2, we introduce the "Quantization Error-aware Time Step Sampling" method. Here, we provide more details for measuring the quantization error from the predicted latent instead of the predicted noise. During the inference stage, the actual denoising step requires the scaling operation on the predicted noise in diffusion models. Therefore, directly calculating the quantization error via noise prediction is not accurate. Instead, we calculate the quantization error in the latent feature space. We derive the relationship of quantization error calculated from the predicted latent and noise as follows:

$\displaystyle E$	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\\|\mathbf{\hat{z}}_{\boldsymbol{% \theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\mathbf{\hat{z}}_{% \boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\\|^% {2}\right],$	(9)
	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\left\\|\left(\frac{1}{\sqrt{\bar{% \alpha}_{t}}}\mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha% }_{t}}}\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{fp}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)-\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}}% \mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}}% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)\right\\|^{2}\right],$
	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\frac{1-\bar{\alpha}_{t}}{\bar{% \alpha}_{t}}\left\\|\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{% fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\boldsymbol{\hat{\epsilon}}_{\boldsymbol{% \theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\right\\|^{2}% \right],$

where $\bar{\alpha}_{t}$ is the noise scheduler in [21].

Appendix F Detailed Metrics for Quantization Error by Quantizing Different Layers

In Sec. 3.2, we calculate the various metrics for representing the quantization error when quantizing different layers. Here, we provide detailed metrics when quantizing each layer of SD-v1.5 to 1, 2, and 3 bits.

Appendix G More Visualization for Quantization Error by Quantizing Different Layers

In Sec. 3.2, we show the images for representing the quantization error when quantizing different layers. Here, we provide more visualization for demonstrating the different quantization errors caused by quantizing different layers to 1 bit. The quantized layers from left to right correspond to the annotated layers at the bottom: SD-v1.5 w/o quantization, Down.0.0.attn2.toq, Down.0.0.attn2.tok, Down.0.0.attn2.tov, Down.1.0.attn2.tok, Down.1.1.attn2.tok, Down.2.res.0.conv1, Up.2.res.2.convshortcut, Up.3.2.attn2.tok, Up.3.res.2.convshortcut.

Appendix H 1.99 Bits Mixed Precision Recipe

We provide our 1.99 bits recipe in our experiments. During the training and inference stage, we add a balancing integer to the $n$ -bits values, resulting in $\texttt{log}(2^{n}+1)$ bits. We calculate the average bits by $\frac{\sum_{i}\texttt{log}(2^{b_{i}^{*}}+1)\times N_{i}+16*N_{tf}}{N_{w}}$ , where $b_{i}^{*}$ is the calculated bit-width in the $i_{th}$ layer, $N_{i}$ is the number of weights of the $i_{th}$ layer, $N_{tf}$ is the number of parameters for pre-cached time features, and $N_{w}$ is the total number of weights in linear and convolutional layers. We calculate the model size by integrating all other parameters as 32 bits. The index and name of each layer are listed:

1 down_blocks.0.attentions.0.proj_in: 6
2 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q: 5
3 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k: 5
4 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v: 4
5 down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0: 6
6 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_q: 5
7 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k: 7
8 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v: 3
9 down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
10 down_blocks.0.attentions.0.transformer_blocks.0.ff.net.0.proj: 3
11 down_blocks.0.attentions.0.transformer_blocks.0.ff.net.2: 3
12 down_blocks.0.attentions.0.proj_out: 5
13 down_blocks.0.attentions.1.proj_in: 4
14 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_q: 3
15 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_k: 4
16 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_v: 6
17 down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_out.0: 5
18 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_q: 5
19 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_k: 7
20 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_v: 2
21 down_blocks.0.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
22 down_blocks.0.attentions.1.transformer_blocks.0.ff.net.0.proj: 3
23 down_blocks.0.attentions.1.transformer_blocks.0.ff.net.2: 3
24 down_blocks.0.attentions.1.proj_out: 6
25 down_blocks.0.resnets.0.conv1: 3
26 down_blocks.0.resnets.0.conv2: 3
27 down_blocks.0.resnets.1.conv1: 3
28 down_blocks.0.resnets.1.conv2: 4
29 down_blocks.0.downsamplers.0.conv: 4
30 down_blocks.1.attentions.0.proj_in: 4
31 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q: 3
32 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k: 3
33 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v: 4
34 down_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0: 4
35 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q: 3
36 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k: 5
37 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v: 4
38 down_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
39 down_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
40 down_blocks.1.attentions.0.transformer_blocks.0.ff.net.2: 2
41 down_blocks.1.attentions.0.proj_out: 4
42 down_blocks.1.attentions.1.proj_in: 4
43 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q: 2
44 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k: 2
45 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v: 4
46 down_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0: 4
47 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q: 3
48 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k: 6
49 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v: 4
50 down_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
51 down_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
52 down_blocks.1.attentions.1.transformer_blocks.0.ff.net.2: 2
53 down_blocks.1.attentions.1.proj_out: 4
54 down_blocks.1.resnets.0.conv1: 3
55 down_blocks.1.resnets.0.conv2: 3
56 down_blocks.1.resnets.0.conv_shortcut: 7
57 down_blocks.1.resnets.1.conv1: 3
58 down_blocks.1.resnets.1.conv2: 2
59 down_blocks.1.downsamplers.0.conv: 4
60 down_blocks.2.attentions.0.proj_in: 3
61 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q: 3
62 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k: 2
63 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v: 3
64 down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
65 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q: 3
66 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k: 4
67 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v: 4
68 down_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
69 down_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
70 down_blocks.2.attentions.0.transformer_blocks.0.ff.net.2: 1
71 down_blocks.2.attentions.0.proj_out: 3
72 down_blocks.2.attentions.1.proj_in: 4
73 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q: 4
74 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k: 2
75 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v: 3
76 down_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
77 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q: 3
78 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k: 4
79 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v: 4
80 down_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0: 3
81 down_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
82 down_blocks.2.attentions.1.transformer_blocks.0.ff.net.2: 2
83 down_blocks.2.attentions.1.proj_out: 4
84 down_blocks.2.resnets.0.conv1: 3
85 down_blocks.2.resnets.0.conv2: 2
86 down_blocks.2.resnets.0.conv_shortcut: 4
87 down_blocks.2.resnets.1.conv1: 2
88 down_blocks.2.resnets.1.conv2: 1
89 down_blocks.2.downsamplers.0.conv: 1
90 down_blocks.3.resnets.0.conv1: 1
91 down_blocks.3.resnets.0.conv2: 1
92 down_blocks.3.resnets.1.conv1: 1
93 down_blocks.3.resnets.1.conv2: 1
94 up_blocks.0.resnets.0.conv1: 1
95 up_blocks.0.resnets.0.conv2: 2
96 up_blocks.0.resnets.0.conv_shortcut: 1
97 up_blocks.0.resnets.1.conv1: 1
98 up_blocks.0.resnets.1.conv2: 1
99 up_blocks.0.resnets.1.conv_shortcut: 1
100 up_blocks.0.resnets.2.conv1: 2
101 up_blocks.0.resnets.2.conv2: 1
102 up_blocks.0.resnets.2.conv_shortcut: 1
103 up_blocks.0.upsamplers.0.conv: 1
104 up_blocks.1.attentions.0.proj_in: 3
105 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_q: 2
106 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_k: 1
107 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_v: 2
108 up_blocks.1.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
109 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_q: 3
110 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_k: 4
111 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_v: 4
112 up_blocks.1.attentions.0.transformer_blocks.0.attn2.to_out.0: 2
113 up_blocks.1.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
114 up_blocks.1.attentions.0.transformer_blocks.0.ff.net.2: 2
115 up_blocks.1.attentions.0.proj_out: 3
116 up_blocks.1.attentions.1.proj_in: 3
117 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_q: 2
118 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_k: 2
119 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_v: 2
120 up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0: 2
121 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_q: 2
122 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_k: 4
123 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_v: 3
124 up_blocks.1.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
125 up_blocks.1.attentions.1.transformer_blocks.0.ff.net.0.proj: 1
126 up_blocks.1.attentions.1.transformer_blocks.0.ff.net.2: 1
127 up_blocks.1.attentions.1.proj_out: 3
128 up_blocks.1.attentions.2.proj_in: 3
129 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_q: 1
130 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_k: 1
131 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_v: 2
132 up_blocks.1.attentions.2.transformer_blocks.0.attn1.to_out.0: 2
133 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_q: 1
134 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_k: 3
135 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_v: 2
136 up_blocks.1.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
137 up_blocks.1.attentions.2.transformer_blocks.0.ff.net.0.proj: 1
138 up_blocks.1.attentions.2.transformer_blocks.0.ff.net.2: 1
139 up_blocks.1.attentions.2.proj_out: 2
140 up_blocks.1.resnets.0.conv1: 1
141 up_blocks.1.resnets.0.conv2: 1
142 up_blocks.1.resnets.0.conv_shortcut: 3
143 up_blocks.1.resnets.1.conv1: 1
144 up_blocks.1.resnets.1.conv2: 1
145 up_blocks.1.resnets.1.conv_shortcut: 3
146 up_blocks.1.resnets.2.conv1: 1
147 up_blocks.1.resnets.2.conv2: 1
148 up_blocks.1.resnets.2.conv_shortcut: 3
149 up_blocks.1.upsamplers.0.conv: 2
150 up_blocks.2.attentions.0.proj_in: 4
151 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_q: 2
152 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_k: 2
153 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_v: 3
154 up_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
155 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_q: 1
156 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_k: 2
157 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_v: 1
158 up_blocks.2.attentions.0.transformer_blocks.0.attn2.to_out.0: 1
159 up_blocks.2.attentions.0.transformer_blocks.0.ff.net.0.proj: 1
160 up_blocks.2.attentions.0.transformer_blocks.0.ff.net.2: 1
161 up_blocks.2.attentions.0.proj_out: 3
162 up_blocks.2.attentions.1.proj_in: 4
163 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_q: 2
164 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_k: 3
165 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_v: 3
166 up_blocks.2.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
167 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_q: 1
168 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_k: 3
169 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_v: 1
170 up_blocks.2.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
171 up_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj: 1
172 up_blocks.2.attentions.1.transformer_blocks.0.ff.net.2: 1
173 up_blocks.2.attentions.1.proj_out: 3
174 up_blocks.2.attentions.2.proj_in: 4
175 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_q: 2
176 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_k: 2
177 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_v: 2
178 up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_out.0: 3
179 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_q: 2
180 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_k: 3
181 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_v: 1
182 up_blocks.2.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
183 up_blocks.2.attentions.2.transformer_blocks.0.ff.net.0.proj: 1
184 up_blocks.2.attentions.2.transformer_blocks.0.ff.net.2: 1
185 up_blocks.2.attentions.2.proj_out: 3
186 up_blocks.2.resnets.0.conv1: 1
187 up_blocks.2.resnets.0.conv2: 2
188 up_blocks.2.resnets.0.conv_shortcut: 4
189 up_blocks.2.resnets.1.conv1: 1
190 up_blocks.2.resnets.1.conv2: 2
191 up_blocks.2.resnets.1.conv_shortcut: 4
192 up_blocks.2.resnets.2.conv1: 1
193 up_blocks.2.resnets.2.conv2: 1
194 up_blocks.2.resnets.2.conv_shortcut: 4
195 up_blocks.2.upsamplers.0.conv: 3
196 up_blocks.3.attentions.0.proj_in: 4
197 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_q: 2
198 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_k: 2
199 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_v: 6
200 up_blocks.3.attentions.0.transformer_blocks.0.attn1.to_out.0: 3
201 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_q: 2
202 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k: 3
203 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_v: 1
204 up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_out.0: 1
205 up_blocks.3.attentions.0.transformer_blocks.0.ff.net.0.proj: 1
206 up_blocks.3.attentions.0.transformer_blocks.0.ff.net.2: 1
207 up_blocks.3.attentions.0.proj_out: 4
208 up_blocks.3.attentions.1.proj_in: 4
209 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_q: 2
210 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_k: 3
211 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_v: 5
212 up_blocks.3.attentions.1.transformer_blocks.0.attn1.to_out.0: 3
213 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_q: 2
214 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_k: 3
215 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_v: 1
216 up_blocks.3.attentions.1.transformer_blocks.0.attn2.to_out.0: 1
217 up_blocks.3.attentions.1.transformer_blocks.0.ff.net.0.proj: 2
218 up_blocks.3.attentions.1.transformer_blocks.0.ff.net.2: 1
219 up_blocks.3.attentions.1.proj_out: 4
220 up_blocks.3.attentions.2.proj_in: 6
221 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q: 2
222 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k: 3
223 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v: 4
224 up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0: 3
225 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q: 4
226 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k: 5
227 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v: 1
228 up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0: 1
229 up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj: 3
230 up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2: 2
231 up_blocks.3.attentions.2.proj_out: 4
232 up_blocks.3.resnets.0.conv1: 1
233 up_blocks.3.resnets.0.conv2: 2
234 up_blocks.3.resnets.0.conv_shortcut: 4
235 up_blocks.3.resnets.1.conv1: 2
236 up_blocks.3.resnets.1.conv2: 2
237 up_blocks.3.resnets.1.conv_shortcut: 4
238 up_blocks.3.resnets.2.conv1: 2
239 up_blocks.3.resnets.2.conv2: 2
240 up_blocks.3.resnets.2.conv_shortcut: 4
241 mid_block.attentions.0.proj_in: 2
242 mid_block.attentions.0.transformer_blocks.0.attn1.to_q: 3
243 mid_block.attentions.0.transformer_blocks.0.attn1.to_k: 1
244 mid_block.attentions.0.transformer_blocks.0.attn1.to_v: 2
245 mid_block.attentions.0.transformer_blocks.0.attn1.to_out.0: 2
246 mid_block.attentions.0.transformer_blocks.0.attn2.to_q: 1
247 mid_block.attentions.0.transformer_blocks.0.attn2.to_k: 4
248 mid_block.attentions.0.transformer_blocks.0.attn2.to_v: 4
249 mid_block.attentions.0.transformer_blocks.0.attn2.to_out.0: 3
250 mid_block.attentions.0.transformer_blocks.0.ff.net.0.proj: 2
251 mid_block.attentions.0.transformer_blocks.0.ff.net.2: 1
252 mid_block.attentions.0.proj_out: 3
253 mid_block.resnets.0.conv1: 1
254 mid_block.resnets.0.conv2: 1
255 mid_block.resnets.1.conv1: 1
256 mid_block.resnets.1.conv2: 1
conv_in: 8
conv_out: 8

Appendix I Details for Evaluation Metrics

In Sec. 5, we measure the performance on various metrics such as TIFA, GenEval, CLIP score and FID. Here, we provide more details for these metrics.

TIFA Score. TIFA v1.0 [26] aims to measure the faithfulness of generated images. It includes various 4K text prompts sampled from the MS-COCO captions [44], DrawBench [64], PartiPrompts [77], and PaintSkill [6], associated with a pre-generated set of question-answer pairs, resulting in 25K questions covering 4.5K diverse elements. Image faithfulness is measured by determining if the VQA model can accurately answer the questions from the generated images.

GenEval Score. GenEval [14] measures the consistency between the generated images and the description, including 6 different tasks: single object, two object, counting, colors, position, color attribution. All text prompts are generated from task-specific templates filled in with: randomly sampled object names from MS-COCO [44], colors from Berlin-Kay basic color theory, numbers with 2, 3, 4, and relative positions from "above", "below", "to the left of", or "to the right of". We adopt the pre-trained object detection model Mask2Former (Swin-S-8 $\times$ 2) [5] for evaluation.

CLIP score and FID. CLIP score measures measure the similarity between text prompts and corresponding generated images. FID is used to evaluate the quality of generated images by measuring the distance between the distributions of features extracted from generated images and target images. In the main experiments, evaluation are measured based on MS-COCO 2014 validation set with 30K image-caption pairs [44]. We adopt ViT-B/32 model to evaluate the CLIP score in our experiments.

Appendix J Human Evaluation

In Sec. 5, we provide the human evaluation results. Here, we provide more detailed human evaluation with category and challenge comparisons on PartiPrompts (P2), comparing Stable Diffusion v1.5 and BitsFusion, with the question: Given a prompt, which image has better aesthetics and image-text alignment? Our model is selected 888 times out of 1632 comparisons, indicating a general preference over SD-v1.5, which is chosen 744 times, demonstrating more appealing and accurate generated images.

J.1 Analysis on Categories

Illustrations, People, and Arts. Our model significantly outperforms SD-v1.5 in generating illustrations (77 wins out of 124), images of people (101 out of 174), and arts (45 out of 65).

Outdoor and Indoor Scenes. Our model also shows strength in generating both outdoor (73 out of 131) and indoor scenes (23 out of 40), suggesting better environmental rendering capabilities.

J.2 Analysis on Challenges

Complex and Fine-grained Detail: Our model excels in generating images with complex details (73 out of 113) and fine-grained details (173 out of 312), suggesting advanced capabilities in maintaining detail at varying complexity levels.

Imagination and Style & Format: Our model also shows a strong performance in tasks requiring imaginative (92 out of 149) and stylistic diversity (118 out of 204), highlighting its flexibility and creative handling of artistic elements.

The strong performance in imaginative and artistic categories presents an opportunity to target applications in creative industries, such as digital art and entertainment, where these capabilities can be particularly valuable.

Appendix K Evaluation on Different Schedulers

In the main experiments in Sec. 5, we leverage the PNDM scheduler to generate images. Here, we measured the performance of different schedulers, such as DDIM [70] and DPMSolver [50], to demonstrate the generality and effectiveness of BitsFusion. We set 50 inference steps and fix the random seed as 1024. As shown in Fig. 15, BitsFusionconsistently outperforms SD-v1.5 with different schedulers.

Appendix L FID Results and Discussion

As stated in SDXL [55] and PickScore [35], FID may not honestly reflect the actual performance of the model in practice. FID measures the average distance between generated images and reference real images, which is largely influenced by the training datasets. Also, FID does not capture the human preference which is the crucial metric for evaluating text-to-image synthesis. Nevertheless, we present FID results evaluated on the 30K MS-COCO 2014 validation set in Fig. 16. Our Stage-I model has a similar FID as SD-v1.5. However, as training progresses, although our Stage-II model is preferred by users, its FID score is higher than both Stage-I and SD-v1.5.

Appendix M Detailed GenEval Results

In Sec. 5, we provide the overall GenEval results. Here, we provide detailed GenEval results for further comparisons as illustrated in Tab. 8.

Table 8: Detailed GenEval with different CFG scales.

Method	Overall	Single	Two	Counting	Colors	Position	Color Attribution
Guidance Scale = 2.5
SD-v1.5	0.3589	0.9350	0.2626	0.2775	0.6043	0.0340	0.0400
Ours-I	0.3353	0.9075	0.2444	0.2550	0.5426	0.0280	0.0340
Ours-II	0.4024	0.8975	0.3859	0.2750	0.6979	0.0560	0.1020
Guidance Scale = 3.5
SD-v1.5	0.3879	0.9400	0.3010	0.3275	0.6787	0.0300	0.0500
Ours-I	0.3650	0.9500	0.2808	0.2575	0.6277	0.0280	0.0460
Ours-II	0.4370	0.9350	0.4727	0.3125	0.7340	0.0600	0.1080
Guidance Scale = 4.5
SD-v1.5	0.4056	0.9700	0.3010	0.3200	0.7426	0.0340	0.0660
Ours-I	0.3851	0.9500	0.3091	0.3100	0.6574	0.0340	0.0500
Ours-II	0.4516	0.9575	0.4788	0.3450	0.7723	0.0520	0.1040
Guidance Scale = 5.5
SD-v1.5	0.4094	0.9750	0.3111	0.3325	0.7319	0.0400	0.0660
Ours-I	0.4039	0.9675	0.3232	0.3425	0.7000	0.0300	0.0600
Ours-II	0.4567	0.9600	0.4909	0.3175	0.7979	0.0540	0.1200
Guidance Scale = 6.5
SD-v1.5	0.4224	0.9800	0.3293	0.3725	0.7447	0.0400	0.0680
Ours-I	0.4161	0.9675	0.3414	0.3350	0.7425	0.0360	0.0740
Ours-II	0.4612	0.9750	0.4990	0.3275	0.7957	0.0540	0.1160
Guidance Scale = 7.5
SD-v1.5	0.4262	0.9775	0.3313	0.3850	0.7596	0.0440	0.0600
Ours-I	0.4226	0.9775	0.3495	0.3600	0.7447	0.0360	0.0680
Ours-II	0.4682	0.9800	0.5091	0.3300	0.8085	0.0680	0.1140
Guidance Scale = 8.5
SD-v1.5	0.4271	0.9825	0.3273	0.3925	0.7745	0.0320	0.0540
Ours-I	0.4269	0.9800	0.3616	0.3475	0.7702	0.0400	0.0620
Ours-II	0.4747	0.9700	0.5111	0.3675	0.8213	0.0620	0.1160
Guidance Scale = 9.5
SD-v1.5	0.4260	0.9825	0.3556	0.3825	0.7553	0.0280	0.0520
Ours-I	0.4190	0.9825	0.3495	0.3450	0.7447	0.0300	0.0620
Ours-II	0.4736	0.9700	0.5192	0.3625	0.8277	0.0560	0.1060

Appendix N More Comparisons

We provide the prompts for the images featured in the Fig. 1. Additionally, we provide more generated images for the comparison.

N.1 Prompts

Prompts of Fig. 1 from left to right are:

1. a portrait of an anthropomorphic cyberpunk raccoon smoking a cigar, cyberpunk!,
fantasy, elegant, digital painting, artstation, concept art, matte, sharp focus,
illustration, art by josan Gonzalez

2. Pirate ship trapped in a cosmic maelstrom nebula, rendered in cosmic beach
whirlpool engine, volumetric lighting, spectacular, ambient lights,
light pollution, cinematic atmosphere, art nouveau style,
illustration art artwork by SenseiJaye, intricate detail.

3. tropical island, 8 k, high resolution, detailed charcoal drawing,
beautiful hd, art nouveau, concept art, colourful, in the style of vadym meller

4. anthropomorphic art of a fox wearing a white suit, white cowboy hat,
and sunglasses, smoking a cigar, texas inspired clothing by artgerm,
victo ngai, ryohei hase, artstation. highly detailed digital painting,
smooth, global illumination, fantasy art by greg rutkowsky, karl spitzweg

5. a painting of a lantina elder woman by Leonardo da Vinci . details, smooth,
sharp focus, illustration, realistic, cinematic, artstation, award winning, rgb ,
unreal engine, octane render, cinematic light, macro, depth of field, blur,
red light and clouds from the back, highly detailed epic cinematic concept art CG
render made in Maya, Blender and Photoshop, octane render, excellent composition,
dynamic dramatic cinematic lighting, aesthetic, very inspirational, arthouse.

6. panda mad scientist mixing sparkling chemicals, high-contrast painting

7. An astronaut riding a horse on the moon, oil painting by Van Gogh.

8. A red dragon dressed in a tuxedo and playing chess. The chess pieces are
fashioned after robots.

N.2 Additional Image Comparisons

We provide more images for further comparisons. For each set of two rows, the top row displays images generated using the full-precision Stable Diffusion v1.5, while the bottom row features images generated from BitsFusion, where the weights of UNet are quantized into 1.99 bits and the model size is $7.9\times$ smaller than the one from SD-v1.5. All the images are synthesized under the setting of using PNDM sampler with $50$ sampling steps and random seed as $1024$ .

$\displaystyle E$	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\\|\mathbf{\hat{z}}_{\boldsymbol{% \theta}_{\texttt{fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\mathbf{\hat{z}}_{% \boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\\|^% {2}\right],$	(9)
	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\left\\|\left(\frac{1}{\sqrt{\bar{% \alpha}_{t}}}\mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha% }_{t}}}\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{fp}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)-\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}}% \mathbf{z}_{t}-\frac{\sqrt{1-\bar{\alpha}_{t}}}{\sqrt{\bar{\alpha}_{t}}}% \boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{int},\mathbf{s}}}(t,% \mathbf{z}_{t},\mathbf{c})\right)\right\\|^{2}\right],$
	$\displaystyle=\mathbb{E}_{t,\mathbf{x}}\left[\frac{1-\bar{\alpha}_{t}}{\bar{% \alpha}_{t}}\left\\|\boldsymbol{\hat{\epsilon}}_{\boldsymbol{\theta}_{\texttt{% fp}}}(t,\mathbf{z}_{t},\mathbf{c})-\boldsymbol{\hat{\epsilon}}_{\boldsymbol{% \theta}_{\texttt{int},\mathbf{s}}}(t,\mathbf{z}_{t},\mathbf{c})\right\\|^{2}% \right],$


(a) DDIM		(b) DPMSolver