DiM Diffusion Mamba for efficient high resolution image synthesis
DiM Diffusion Mamba for efficient high resolution image synthesis
DiM Diffusion Mamba for efficient high resolution image synthesis
Abstract
Diffusion models have achieved great success in image generation, with the back-
bone evolving from U-Net to Vision Transformers. However, the computational
cost of Transformers is quadratic to the number of tokens, leading to significant
challenges when dealing with high-resolution images. In this work, we propose
Diffusion Mamba (DiM), which combines the efficiency of Mamba, a sequence
model based on State Space Models (SSM), with the expressive power of diffusion
models for efficient high-resolution image synthesis. To address the challenge
that Mamba cannot generalize to 2D signals, we make several architecture de-
signs including multi-directional scans, learnable padding tokens at the end of
each row and column, and lightweight local feature enhancement. Our DiM
architecture achieves inference-time efficiency for high-resolution images. In addi-
tion, to further improve training efficiency for high-resolution image generation
with DiM, we investigate “weak-to-strong” training strategy that pretrains DiM
on low-resolution images (256 × 256) and then finetune it on high-resolution
images (512 × 512). We further explore training-free upsampling strategies to
enable the model to generate higher-resolution images (e.g., 1024 × 1024 and
1536 × 1536) without further fine-tuning. Experiments demonstrate the effec-
tiveness and efficiency of our DiM. The code of our work is available here:
https://github.com/tyshiwo1/DiM-DiffusionMamba/.
1 Introduction
Diffusion models have achieved great success in image generation [1–4]. The backbones of dif-
fusion models have evolved from Convolutional Neural Networks represented by U-Net [5] to
Vision Transformers [6–10], because of the effectiveness and scalability of transformer architectures.
Transformer-based diffusion models encode images into latent feature maps, patchify the latent
feature maps, project patches to tokens, and then apply transformers to denoise the image tokens.
However, the complexity of self-attention layers in transformers is quadratic to the number of tokens,
leading to substantial challenges in terms of computation cost for high-resolution generation.
Recently, Mamba [11], a sequence model backbone based on State Space Models (SSM) [12], has
shown remarkable effectiveness and efficiency across several modalities such as language, audio,
and genomics [11, 13]. Mamba achieves comparable performance with Transformers with better
inference time efficiency. In particular, Mamba shows great promise in long sequence modeling,
because the computational complexity of Mamba scales linearly with the number of tokens, compared
to quadratic scaling for Transformers. Those properties of Mamba motivate us to introduce Mamba
as a new backbone into diffusion models, particularly for efficient high-resolution image generation.
∗
Corresponding Author
However, several challenges arise when incorporating Mamba with diffusion models for high-
resolution image generation. The primary challenge stems from the mismatch between the causal
sequential modeling of Mamba and the two-dimensional (2D) data structure of images. Mamba is
designed for one-dimensional (1D) causal modeling for sequential signals, which cannot be directly
leveraged for modeling two-dimensional image tokens. A simple solution is to use the raster-scan
order to convert 2D data into 1D sequences. However, it restricts the receptive field of each location
to only the previous locations in the raster-scan order. Moreover, in the raster-scan order, the ending
of the current row is followed by the beginning of the next row, while they do not share spatial
continuity. The second challenge is, despite the advantage of Mamba for efficient inference, training
Mamba-based diffusion models on high-resolution images is still costly.
To mitigate the first challenge, we propose Diffusion Mamba (DiM) (shown in Fig. 2), a Mamba-based
diffusion model backbone for efficient high-resolution image generation. In our framework, we follow
transformer-based diffusion models to encode and patchify images into patch features. Then, we
leverage Mamba architecture as the diffusion backbone to model the patch features. In order to avoid
unidirectional causal relationships among patches and to endow each token with a global receptive
field, we design the Mamba blocks to alternately execute four scanning directions. Moreover, we
insert learnable padding tokens between the two tokens which are adjacent in the scanning order but
not adjacent in the spatial domain, so as to allow the Mamba blocks to discern the image boundaries
and to avoid misleading the sequence model. We also add 3 × 3 depth-wise convolution layers to
the input layer and output layer of the network to enhance the local coherence of generated images.
Additionally, we add long skip connections [5, 7] between shallow and deep layers to propagate the
low-level information to the high-level features, which is proven to be beneficial for the pixel-level
prediction objective in diffusion models.
2
To tackle the challenge of training efficiency for high-resolution image generation with DiM, we
explore resource-efficient approaches to adapt our DiM model pretrained on low-resolution images
for high-resolution image generation. We first observe empirical evidence that our DiM pretrained on
low-resolution images provides reasonable prior for high-resolution image generation. So we explore
the “weak-to-strong” training strategy [14] where we first train DiM on low-resolution images, and
then use the pretrained model as initialization to efficiently fine-tune on high-resolution images. This
strategy largely reduces the training time cost for high-resolution image generation. We also explore
training-free upsampling approaches to adapt DiM to generate higher-resolution images without
further finetuning.
We conduct experiments on CIFAR-10 [15] and ImageNet [16]. On CIFAR-10, DiM-Small can
achieve an FID score of 2.92, and we also perform ablation studies to demonstrate the effectiveness
of our designs. On ImageNet, we pretrain our DiM-Huge at a resolution of 256 × 256, and then
fine-tune DiM-Huge efficiently at a higher resolution of 512 × 512. Despite trained with much
fewer iterations, our DiM-Huge achieves comparable performance to other transformer-based and
SSM-based diffusion models, demonstrating the effectiveness and training efficiency of our approach.
Moreover, with training-free upsampling schemes, our DiM fine-tuned on 512 × 512 images can
further generate 1024 × 1024 and 1536 × 1536 images. We analyze the inference time to demonstrate
the efficiency of DiM for high-resolution image synthesis.
In summary, our contributions lie in the following aspects:
• We propose a new Mamba-based diffusion model, DiM, for efficient high-resolution image
generation. We propose several effective designs to endow Mamba, which was designed for
processing 1-D signals, with the ability to model 2-D images.
• To address the high cost of training on high-resolution images, we investigate strategies to
fine-tune DiM pretrained on low-resolution images for high-resolution image generation.
Moreover, we explore training-free upsampling schemes to adapt the model to generate
higher-resolution images without further fine-tuning.
• The experiments on ImageNet and CIFAR demonstrate the training efficiency, inference
efficiency, and effectiveness of our DiM in high-resolution image generation.
2 Related Work
Backbones of Diffusion Models. Traditionally, U-Net [5] serves as a backbone for diffusion
models [1, 4, 17]. This architecture is characterized by its down-sampling and up-sampling blocks,
connected by long skip connections. Each block of this U-Net is composed of convolutional layers
and attention modules [18]. However, the scalability of this architecture has not been successfully
demonstrated. Recently, transformer-based diffusion models have been proposed [6, 8, 9, 7, 19].
Different from the diffusion models based on U-Net, the architecture of these transformer-based
models is built exclusively on attention modules and multi-layer perceptrons (MLPs), without any
convolutional layers. These models have shown remarkable scalability in the generative tasks of
computer vision [19, 20, 10]. Notably, Sora [10] exemplifies the great scalability of transformers in
generating high-quality videos. To effectively train these diffusion models with scaled-up backbones,
many works suggest performing the training process at lower resolutions and then fine-tuning the
pre-trained models at higher resolutions [9, 8, 21]. While the transformer excels in general image
generation, its limited efficiency in processing large quantities of tokens hinders the progression of
high-resolution image generation.
State Space Models. State Space Model (SSM) [12, 22–24] is proposed for sequential modeling, and
it has been applied in control theory, signal processing, and natural language processing [12]. The
inputs and outputs of SSMs are one-dimensional sequences. In the process of SSMs, the tokens from
the input sequence are recurrently mapped to the hidden states through a linear transformation (addi-
tion and element-wise multiplication) with the preceding hidden states and the model weights. The
outputs are also originated from the hidden states via linear transformation with other model weights.
Mamba [11] is a new type of state space model where each block contains a selective scan module, a
1D causal convolution, and a normalization layer. The process of the selective scan is similar to that
in SSMs, but it implements the function of data selection via input-dependent model weights [25]. To
accelerate the training and inference of Mamba, this selective scan is implemented via a work-efficient
3
Unpad &
Mamba Block DWConv3x3 Unpatchify Prediction
Unflatten
Scan Switch
c ① ②
Mamba Block
Scan
Scan Switch Patterns
c
③ ④
Skip …
Mamba Block
t c p
Scan Switch
Pad
Time Class Pad
Token Token Token
Mamba Block
2D patches Padded 2D patches
Scan Switch Tokenization
Pad & 2D Patches Noisy
Mamba Block Flatten
DWConv3x3 Features
Patchify
Image/Latent
Mamba Timestep + Class
Figure 2: Overview. The inputs of our framework is a noisy image/latent, with a timestep and a
class condition. The noisy inputs are transformed into patch-wise features, processed by a depth-wise
convolution, and appended with time, class, and padding tokens. The features are flattened and
scanned by Mamba blocks with four directions. The features are then transformed into 2D patches,
processed by another convolution, and finally used for noise prediction.
parallel scanning algorithm [26] in SRAM. In the following paragraph, we will delve into the existing
state space models used for image generation and provide a detailed comparison with our method.
State Space Models in Vision. The state space models have already been applied in computer vision
even prior to Mamba. DiffuSSM [27] is the first diffusion model replacing attention mechanisms
with state space models. Recently, Mamba has been proposed for better modeling power compared
to the preceding state space models. Various Mamba variants are then proposed for vision tasks
on image and video inputs [28–31]. Also, several concurrent works introduce Mamba into image
generation. DiS [32] directly incorporates ViM [28], a variant of Mamba, into image generation,
exploring its generative capabilities up to a maximum of 512 × 512 resolution images. ZigMa [33]
utilizes the vanilla Mamba blocks with various scan patterns and is trained on high-resolution human
face generation datasets [34]. Distinct from the above methods, we propose the first Mamba-based
diffusion model which can generate images with more than 10K tokens, unleashing the power of
Mamba on long sequence processing. Our model also contains several new modules for spatial prior
and achieves comparable performance to the transformer-based diffusion models on the widely used
benchmarks. We are also the first framework to validate the fine-tuning ability of Mamba-based
diffusion model at various resolutions.
3 Method
We introduce the preliminaries of State Space Models and Mamba in Sec. 3.1. We then introduce the
network architecture and several designs of our Mamba-based diffusion model (DiM) in Sec. 3.2.
The several designs of DiM adapts Mamba, which was designed for processing 1-D signals, for 2D
image generation, and enables efficient inference for high-resolution image generation. In Sec. 3.3,
we investigate the fine-tuning and training-free strategies to improve the training efficiency of DiM
for high-resolution image generation.
State Space Models (SSM) [12, 22, 24, 23] are designed to encode and decode one-dimensional
sequential inputs. In a continuous-time SSM, an input signal x(i) is first encoded into a hidden
4
state vector h(i) and then decoded into the output signal y(i) according to the following ordinary
differential equations (ODEs):
h′ (i) = A h(i) + B x(i), y(i) = C h(i) + D x(i), (1)
where h′ denotes the derivative of h, and A, B, C, D denote the weights of SSMs. Typically, the
inputs in natural languages and two-dimensional vision are discrete signals, so Mamba leverages the
zero-order hold (ZOH) rule for discretization. Thus, the above ODEs can be recurrently solved:
−1
Ā = exp (∆A) , B̄ = (∆A) (exp (∆A) − I) · ∆B, (2)
hi = Ā hi−1 + B̄ xi , yi = C hi + D xi , (3)
where ∆ is also a model parameter. Recently, Mamba [11] proposes to improve the flexibility of SSM
by changing the time-invariant parameters to be time-varying. This modification involves replacing
the static model weights (B, C, ∆) with the dynamic weights [25] dependant on the input x. This
process with input-dependant parameters is termed as the selective scan.
Mamba is primarily designed for processing one-dimensional inputs, so it is difficult for Mamba to
learn the two-dimensional data structure of images without any modification. Therefore, we propose
several new architectural designs that enable DiM to handle spatial structure.
Overall architecture. As depicted in Fig. 2, our framework processes a noisy two-dimensional (2D)
input, such as an image or latent features [4], with a timestep and a class condition. This noisy input
can be deemed as a clean signal perturbed by a certain level of Gaussian noise corresponding to the
input timestep. The noisy input is first split into 2D patches, and each patch is transformed into a
high-dimentional feature vector by a fully-connected layer. Next, these patches are fed into a 3 × 3
depth-wise convolution layer, where the local information is injected into the patches. The patches
are also padded with learnable tokens at the end of rows and columns, allowing the model to be aware
of the 2D spatial structure during the 1-D sequential scanning. Then, the patch tokens are flattened
into a patch sequence, using one of the four scan patterns illustrated in Fig. 2. The timestep and class
condition are also transformed into tokens by fully-connected layers, and are then appended to the
sequence [7]. Subsequently, the sequence is fed into the Mamba blocks for scanning. Additionally,
we add long skip connections [5, 7] between shallow and deep layers to propagate the low-level
information to the high-level features, which is proven to be beneficial for the pixel-level prediction
objective in diffusion models. We illustrate several design choices in the following paragraphs.
Scan patterns. A global receptive field is crucial for our model to efficiently capture the spatial
structure within images. Scanning the image patches in a single raster-scan direction leads to uni-
directional and limited receptive fields of patches. For example, the first scanned patch on the top-left
corner would never aggregate the information from other patches. To allow each patch to have a
global receptive field we adopt different scanning patterns at different model blocks. Specifically, as
shown in Fig. 2, in the first block, we adopt the row-major scan, i.e., we scan the sequence of image
patches row by row, with each row being scanned horizontally from left to right and then move to
the next row. In the second block, we reverse the sequence order and scan the sequence in the same
manner. In the subsequent blocks, we perform column-major scans in both the forward and reverse
order. After traversing all scan patterns, we loop over them again across the next model blocks.
Learnable padding token. The learning of spatial structure of images may be disrupted by the raster
scan. To be specific, when we flatten an image into a patch sequence, the right-most patch in one
row of the image becomes adjacent to the left-most patch of the next row. However, the contents
represented by these two feature vectors may vastly differ. This contradicts the inherent continuity
and spatial structure of images, thereby hindering the learning process. To mitigate this issue, we
allow the model to be aware of the end-of-line (EOL) by appending learnable padding tokens at the
end of each row or each column.
Lightweight local feature enhancement. The local structure of images is disrupted by the flattening
of tokens for the scan. For example, in the row-major scan, the patch at row i and column j is no
longer adjacent to the patch at row (i + 1) and column j. Furthermore, since Mamba has been
designed for extreme efficiency, we opt to enhance the local structure by adding a few lightweight
modules at the beginning and the end of the network, instead of altering the Mamba blocks. To be
5
specific, we introduce two 3 × 3 depth-wise convolution layers. One convolution layer is inserted
after the patchify layer before feeding the tokens into Mamba blocks. Another convolution layer
is inserted after all the Mamba blocks, before the unpatchify and output layer. Those lightweight
depth-wise convolution layers provide DiM with awareness of the 2D local continuity.
Despite its inference-time efficiency, training DiM on high-resolution images requires a lot of time
and computational resources. In this subsection, we investigate the strategies to improve the training
efficiency of DiM for high-resolution images.
“Weak-to-strong” training and fine-tuning. Training a diffusion model from scratch for high-
resolution images requires a lot of time and computational resources. We observe that DiM pretrained
on low-resolution images can provide a rough initialization for high-resolution training, shown
in Fig. 6. Therefore, we consider a “weak-to-strong” training strategy [14] where we pretrain our
model from scratch on low-resolution images and then perform fine-tuning on higher resolutions.
During fine-tuning, we upscale the length and width of images by a factor of 2. This strategy largely
reduces the computational cost for training high-resolution image generators with DiM.
Training-free upsampling. Extremely high-resolution images with annotations are not easy to obtain,
making it difficult to fine-tune to DiM to higher resolution. Therefore, we explore the training-free
resolution upsampling capability of our model. For example, we directly use our model trained
on 512 × 512 dataset [16] to generate 1024 × 1024 images. However, performing training-free
super-resolution image generation on our model is non-trivial. We observe that directly feeding our
network with a higher-resolution Gaussian noise results in images with repetitive patterns, corrupted
global structures, and collapsed spatial layouts. Only the local structure and details exhibit relatively
good quality. In order to generate better global structures, we utilize the upsample-guidance [35] at
the early diffusion timesteps (e.g., the first 30% of timesteps):
1 1
ϵθ (xt , t) = ϵθ (xt , t) + ωt U ϵθ √ D [xt ] , τ − D [ϵθ (xt , t)] , (4)
m Pt
where ϵθ denotes our noise-prediction Mamba-based model, m denotes the upscaling factor (e.g.,
m = 2), U denotes the nearest upsampling operator by scale m, D denotes the average pooling
operator (down-sampling) by stride m, xt denotes a noisy input, t denotes an input diffusion timestep,
τ denotes the timestep whose signal-to-noise ratio is m2 times the signal-to-noise ratio of t, Pt
denotes an coefficient to calibrate the overall power of the predicted noise at each timestep, and ωt
denotes the weight for upsample-guidance. In the later diffusion timesteps, we directly feed the
higher-resolution noisy inputs into DiM for noise prediction.
4 Experiments
4.1 Experimental Setup
Model configuration. Following the existing set- Table 1: The configuration of our model.
tings [7, 32, 33], we present three versions of our
framework with different model sizes in Tab. 1, Model Params Blocks Hidden dim Gflops
where the Gflops is calculated with a batch size Small (S) 50M 25 512 12
of 1, and the input size is set as 32 × 32 with- Large (L) 380M 49 1024 94
out image autoencoder [36]. As for the hyper- Huge (H) 860M 49 1536 210
parameters of Mamba blocks, we follow the stan-
dard settings [11]. Following the traditional set-
tings [7, 6], we set the patch size as 2 × 2 for DiM trained on ImageNet and CIFAR.
Implementation details. Every training experiment is performed on 8 A100-80G. Following the
previous works [7], we use the same DDPM [1] scheduler, pretrained image autoencoder [36] and
DPM-Solver [37]. We use random flip as the data augmentation. The learning rate is set to 2 × 10−4 .
We also use EMA with a rate of 0.9999.
Datasets and evaluation metrics. We use FID-50K [38] as the metrics on all datasets for evaluation.
The specific settings for each dataset are as follows: (1) CIFAR: The model is trained for unconditional
6
(a) ImageNet 512 × 512 (b) ImageNet 256 × 256
Figure 4: The images generated by DiM-Huge with cfg=4.0.
image generation with a batch size of 128. (2) ImageNet: We train the model for conditional image
generation. We also use classifier-free guidance for evaluation, and the guidance weight for calculating
FID is identical to that in [7]. When performing pretraining on ImageNet 256 × 256, we set the batch
size as 1024 and 768 for DiM-Large and DiM-Huge, respectively. When fine-tuning DiM-Huge on
ImageNet 512 × 512, we set the batch size as 240 with gradient accumulation.
Training and inference setups. On ImageNet, we first pretrain DiM with more than 300K iterations
at 256 × 256 resolution. We then finetune the pretrained model on 512 × 512 resolution. To achieve
a higher resolution without the cost of training, we further use training-free upsampling techniques to
generate 1024 × 1024 and 1536 × 1536 images with DiM-Huge trained on 512 × 512 resolution.
In this subsection, we examine the efficiency of DiM and compare it the transformer backbone.
A single selective scan is more efficient than FlashAttention V2 [39]. However, to maintain a
similar number of parameters, the standard Mamba has twice as many blocks as the transformer.
These doubled scans increase the computational com-
plexity. Additionally, our proposed modules includ- 2.5
U-ViT
ing the switching of scanning patterns also create DiM
2 Mamba Baseline
slight latency. To compare the practical efficiency of
FPS w/ log scale
1.4× Faster
2.2× Faster
Model trained on ImageNet. As shown in Fig. 4b, we select a set of generated images for
visualization. The results show that DiM-Huge pretrained on ImageNet can generate high-quality
7
(a) ImageNet 1024 × 1024 (b) ImageNet 1536 × 1536
Figure 5: The high-resolution images generated by DiM-Huge trained on 512 × 512 images.
256 × 256 images with a classifier-free guidance weight of 4.0. As shown in Fig. 4a, our model
fine-tuned on ImageNet with 512 × 512 resolution also shows great performance.
Training-free up-sampling. We can use our model trained on 512 × 512 ImageNet dataset [16] to
directly generate 1024 × 1024 and 1536 × 1536 images. As shown in Fig. 5, even when the resolution
is increased to three times that of training, our model is still able to generate visually appealing
images with upsample-guidance [35].
ImageNet 256 × 256 pretraining. We compare DiM to other transformer-based and SSM-based
diffusion models in Tab. 2. After training on 319 million image samples, DiM-Huge can achieve
a score of 2.40 on FID-50K. In the case that we use 63% of the training data of U-ViT [7] (319M
versus 500M), the performance of our model is comparable to the other transformer-based diffusion
models, i.e., only about 0.1 worse on FID-50K. When we train the model with 480M image samples,
our model can outperform other models, achieving a score of 2.21 on FID-50K. Moreover, compared
to the DiffuSSM-XL, the Gflops of our Mamba-based diffusion model is much smaller, i.e., DiM
requires fewer resources for inference.
ImageNet 512 × 512 finetuning. Training on 512 × 512 image samples requires significant com-
putational resources. Also, such a large resolution creates non-negligible latency during training
Table 3: The results of various models on ImageNet 512 × 512. Pretrain in this table denotes DiM
trained at 256 × 256 resolution.
Mode Images (Iterations × BatchSize) Parameters GFlops FID
U-ViT-L/4 [7] 300M (300K × 1024) 287M 76 4.67
U-ViT-H/4 [7] 500M (500K × 1024) 501M 133 4.05
DiT [6] 768M 675M 524 3.04
DiffuSSM-XL [27] 302M 660M 1066 3.41
DiM-Huge Pretrain + 15M (64K × 240) 860M 708 3.94
DiM-Huge Pretrain + 26M (110K × 240) 860M 708 3.78
8
Table 5: Benchmark of unconditional image Table 6: Ablation Studies of the scanning pat-
generation on CIFAR-10 [15]. terns on CIFAR-10 [15]. The circled numbers
Mode Parameters FID
corresponds to Fig. 2.
Scanning Patterns FID
DDPM [1] 36M 3.17
GenViT [41] 11M 20.20 ①②③④ 2.92
U-ViT-S [7] 44M 3.11 ①② 2.91
Ours-S 50M 2.92 ① 17.60
Conditional Generation with CFG Conditional Generation Unconditional Generation Conditional Generation with CFG Conditional Generation Unconditional Generation
(a) Model with the scan patterns ①②③④ (b) Model with the scan patterns ①②
Figure 6: The initial 512 × 512 images generated by the model trained on 256 × 256 images without
any additional techniques before finetuning. Comparing (a) and (b), we find DiM with row- and
column-major scans can provide better object structures than the model with only row-major scans.
The circled numbers corresponds to Fig. 2.
and inference, shown in Fig. 3. Therefore, instead of training from scratch, we finetune DiM-Huge
pretrained on ImageNet 256 × 256 for higher resolutions, and report the results in Tab. 3. With 3%
of the 512 × 512 training data of U-ViT [7] (15M versus 500M), DiM-Huge achieves 3.94 FID-50K.
If we further finetune our model with 110K iterations, it can achieve 3.78 FID-50K. The finetuned
DiM-Huge can produce visually appealing 512 × 512 images, shown in Fig. 4a.
CIFAR-10. We compare our method to other diffusion models on CIFAR-10 dataset, and we report
the results in Tab. 5. The results show that our method can have comparable performance to other
methods with the similar number of parameters.
9
5 Conclusion
In this paper, we propose Diffusion Mamba (DiM), a new Mamba-based diffusion model backbone,
for efficient high-resolution image generation. In our framework, the sequence model Mamba
is used to process the patch features of the two-dimensional noisy inputs. To adapt Mamba for
two-dimensional data, we propose several approaches, including scan pattern switching, learnable
padding token, and lightweight local feature enhancement. Then, to efficiently train the model
with high-resolution image samples, we propose to use a “Weak-to-strong” training and fine-tuning
for DiM. The experiments demonstrate that our model can achieve comparable performance with
other transformer-based diffusion models on high-resolution image generation. We also explore the
traning-free upsampling of DiM to generate higher-resolution images without further finetuning. We
have also add the failure cases, limitations and broader impacts in our appendix.
References
[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020. 1,
3, 6, 9
[2] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR.
OpenReview.net, 2021.
[4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
image synthesis with latent diffusion models. In CVPR, pages 10674–10685. IEEE, 2022. 1, 3, 5
[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241.
Springer, 2015. 1, 2, 3, 5
[6] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 1, 3, 6, 8
[7] Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A
vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 22669–22679, 2023. 2, 3, 5, 6, 7, 8, 9
[8] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James
Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic
text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023. 3
[9] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi,
Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution
image synthesis. arXiv preprint arXiv:2403.03206, 2024. 3
[10] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor,
Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as
world simulators. 2024. 1, 3
[11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint
arXiv:2312.00752, 2023. 1, 3, 5, 6
[12] Albert Gu. Modeling Sequences with Structured State Spaces. Stanford University, 2023. 1, 3, 4
[13] Xilin Jiang, Cong Han, and Nima Mesgarani. Dual-path mamba: Short and long-term bidirectional
selective structured state space models for speech separation. arXiv preprint arXiv:2403.18257, 2024. 1
[14] Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo,
Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k
text-to-image generation. arXiv preprint arXiv:2403.04692, 2024. 3, 6
[15] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 3, 9
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.
Ieee, 2009. 3, 6, 8
10
[17] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna,
and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv
preprint arXiv:2307.01952, 2023. 3
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017. 3
[19] Peng Gao, Le Zhuo, Ziyi Lin, Chris Liu, Junsong Chen, Ruoyi Du, Enze Xie, Xu Luo, Longtian Qiu,
Yuhang Zhang, et al. Lumina-t2x: Transforming text into any modality, resolution, and duration via
flow-based large diffusion transformers. arXiv preprint arXiv:2405.05945, 2024. 3
[20] Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang,
Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang,
Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang,
Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue, Yangyu Tao, Jianchen Zhu,
Kai Liu, Sihuan Lin, Yifu Sun, Yun Li, Dongdong Wang, Mingtao Chen, Zhichao Hu, Xiao Xiao, Yan
Chen, Yuhong Liu, Wei Liu, Di Wang, Yong Yang, Jie Jiang, and Qinglin Lu. Hunyuan-dit: A powerful
multi-resolution diffusion transformer with fine-grained chinese understanding, 2024. 3
[21] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme
Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and
language model. arXiv preprint arXiv:2305.18565, 2023. 3
[22] Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher Ré. How to train your hippo: State
space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037, 2022. 3, 4
[23] Albert Gu, Karan Goel, Ankit Gupta, and Christopher Ré. On the parameterization and initialization of
diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
4
[24] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry
hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022. 3, 4
[25] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks. In NIPS, pages
667–675, 2016. 3, 5
[26] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence
modeling. arXiv preprint arXiv:2208.04933, 2022. 4
[27] Jing Nathan Yan, Jiatao Gu, and Alexander M. Rush. Diffusion models without attention. CoRR,
abs/2311.18257, 2023. 4, 8
[28] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision
mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint
arXiv:2401.09417, 2024. 4
[29] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan
Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
[30] Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. Videomamba: State
space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
[31] Guo Chen, Yifei Huang, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu,
and Limin Wang. Video mamba suite: State space model as a versatile alternative for video understanding.
arXiv preprint arXiv:2403.09626, 2024. 4
[32] Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Scalable diffusion models with state
space backbone. arXiv preprint arXiv:2402.05608, 2024. 4, 6
[33] Vincent Tao Hu, Stefan Andreas Baumann, Ming Gui, Olga Grebenkova, Pingchuan Ma, Johannes Fischer,
and Bjorn Ommer. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024. 4, 6
[34] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
4401–4410, 2019. 4
[35] Juno Hwang, Yong-Hyun Park, and Junghyo Jo. Upsample guidance: Scale up diffusion models without
training. arXiv preprint arXiv:2404.01709, 2024. 6, 8
11
[36] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
2013. 6
[37] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ODE
solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022. 6
[38] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information
processing systems, 30, 2017. 6
[39] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint
arXiv:2307.08691, 2023. 7
[40] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. In ICLR. OpenReview.net,
2021. 9
[41] Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao, and Shihao Ji. Your vit is secretly a hybrid
discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022. 9
12
A More Results on ImageNet
We present more results of DiM trained on ImageNet with high classifier-free guidance, shown
in Fig. 7, Fig. 8 and Fig. 9. Notably, the classifier-free guidance has a great impact the visual quality
and FID. The larger guidance weight improves the visual quality but leads to poor FID. Empirically,
the best classifier-free guidance weight for FID is set to 1.4 and 1.7 for ImageNet 256 × 256
and 512 × 512, respectively. We also present the generated images under these settings in Fig. 11
and Fig. 12.
Figure 7: More 256 × 256 images generated by DiM-Huge pretrained on ImageNet with cfg=4.0.
B Experiments on FFHQ
FFHQ Dataset. We train DiM-Large for unconditional image generation. To verify the image
generation ability of our model on more than 16K tokens (i.e., Mamba is faster than Transformer
when the number of tokens greater than 10K), we set patch size as 1 × 1 for FFHQ 1024 × 1024
dataset (the number of tokens is calculated by ( 1024 2
1×8 ) = 16, 384).
Training Details. We first train DiM on FFHQ 256 × 256 with a batch size of 1024 for 50K iterations.
Then, we finetune this pretrained model on FFHQ 512 × 512 for 20K iterations with a batch size
of 256. Last, we further finetune the model on FFHQ 1024 × 1024 for 50K iterations with a batch
size of 64. Note that the batch size on each resolution is different, so we use the similar number of
iterations for finetuning.
Quantitative Results. We also show the results of DiM trained on FFHQ 1024 × 1024 with 1 × 1
patch size. The generated images shown in Fig. 10 reveal that our DiM has the capability of processing
on 16K patches, and generate high-resolution human faces after training.
D Broader Impacts
Image generation has wide applications in assisting users, designers, and artists in creating new
content. However, researchers, developers, and users should also be aware of the potential negative
social impact of image generation models. They might be misused for generating misleading content
and biased content.
13
Figure 8: More 512 × 512 images generated by DiM-Huge finetuned on ImageNet with cfg=4.0.
14
Figure 9: More 1024 × 1024 images generated by DiM-Huge trained with 512 × 512 images.
15
Figure 10: FFHQ 1024 × 1024
16
Figure 11: 256 × 256 images generated by DiM-Huge pre-trained on ImageNet with cfg=1.4.
Figure 12: 512 × 512 images generated by DiM-Huge finetuned on ImageNet with cfg=1.7.
(a) Failure cases on human. (b) Failure cases on the training-free up-sampling.
Figure 13: Failure cases. (a) At each resolution, the image quality on human cases is unstable. The
generated human faces and limbs are easy to collapse. (b) the problem of repeating patterns is not
well resolved by the training-free upsampling at the resolution of 1536 × 1536. Also, the background
becomes cluttered, illustrating that DiM trained at lower resolution still has difficulty processing the
details at 3× higher resolution.
17