0% found this document useful (0 votes)
65 views10 pages

Kandinsky - An Improved Text-to-Image Synthesis With Image Prior and Latent Diffusion

Kandinsky- an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Uploaded by

zth18883272097
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views10 pages

Kandinsky - An Improved Text-to-Image Synthesis With Image Prior and Latent Diffusion

Kandinsky- an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Uploaded by

zth18883272097
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Kandinsky: an Improved Text-to-Image Synthesis with

Image Prior and Latent Diffusion


Anton Razzhigaev1,2 , Arseniy Shakhmatov3 , Anastasia Maltseva3 , Vladimir Arkhipkin3 ,
Igor Pavlov3 , Ilya Ryabov3 , Angelina Kuts 3 , Alexander Panchenko2,1 ,
Andrey Kuznetsov3,1 , and Denis Dimitrov3,1
1
AIRI, 2 Skoltech, 3 Sber AI
{razzhigaev, kuznetsov, dimitrov}@airi.net

Abstract and innovative perspective on this dynamic field


of study. First, we describe the new architecture
Text-to-image generation is a significant do-
main in modern computer vision and has
of Kandinsky and its details. The demo system
achieved substantial improvements through the with implemented features of the model is also
arXiv:2310.03502v1 [cs.CV] 5 Oct 2023

evolution of generative architectures. Among described. Second, we show the experiments, car-
these, there are diffusion-based models that ried out in terms of image generation quality and
have demonstrated essential quality enhance- come up with the highest FID score among exist-
ments. These models are generally split into ing open-source models. Additionally, we present
two categories: pixel-level and latent-level ap- the rigorous ablation study of prior setups that we
proaches. We present Kandinsky1 , a novel ex-
conducted, enabling us to carefully analyze and
ploration of latent diffusion architecture, com-
bining the principles of the image prior models evaluate various configurations to arrive at the most
with latent diffusion techniques. The image effective and refined model design.
prior model is trained separately to map text Our contributions are as follows:
embeddings to image embeddings of CLIP. An-
other distinct feature of the proposed model • We present the first text-to-image architecture
is the modified MoVQ implementation, which designed using a combination of image prior
serves as the image autoencoder component. and latent diffusion.
Overall, the designed model contains 3.3B pa-
rameters. We also deployed a user-friendly • We demonstrate experimental results compara-
demo system that supports diverse genera- ble to the state-of-the-art (SotA) models such
tive modes such as text-to-image generation, as Stable Diffusion, IF, and DALL-E 2, in
image fusion, text and image fusion, image terms of FID metric and achieve the SotA
variations generation, and text-guided inpaint-
score among all existing open source models.
ing/outpainting. Additionally, we released the
source code and checkpoints for the Kandinsky • We provide a software implementation of
models. Experimental evaluations demonstrate
a FID score of 8.03 on the COCO-30K dataset,
the proposed state-of-the-art method for text-
marking our model as the top open-source per- to-image generation, and release pre-trained
former in terms of measurable image genera- models, which is unique among the top-
tion quality. performing methods. Apache 2.0 license
makes it possible to use the model for both
1 Introduction non-commercial and commercial purposes.2 3
In quite a short period of time, generative abilities
• We create a web image editor application
of text-to-image models have improved substan-
that can be used for interactive generation
tially, providing users with photorealistic quality,
of images by text prompts (English and Rus-
near real-time inference speed, a great number of
sian languages are supported) on the basis
applications and features, including simple easy-
of the proposed method, and provides in-
to-use web-based platforms and sophisticated AI
painting/outpainting functionality.4 The video
graphics editors.
demonstration is available on YouTube.5
This paper presents our unique investigation of
2
latent diffusion architecture design, offering a fresh https://github.com/ai-forever/Kandinsky-2
3
https://huggingface.co/kandinsky-community
1 4
The system is named after Wassily Kandinsky, a famous https://fusionbrain.ai/en/editor
5
painter and an art theorist. https://www.youtube.com/watch?v=c7zHPc59cWU
— Text embedding

— Image embedding
Neon lights Red dress
Sad clown face

Tiger on the grass CLIP-text CLIP-text CLIP-image CLIP-text


XLMR-CLIP XLMR-CLIP XLMR-CLIP
Encoder Encoder Encoder Encoder

CLIP-text
Encoder
Clip-image Clip-image Diffusion
Image Prior Image Prior
Encoder Encoder Mapping

+ +
MSE Image Prior

time_emb
Loss

time_emb

time_emb
Latent Diffusion Latent Diffusion Latent Diffusion Latent Diffusion Latent Diffusion

+
+

+
U-Net U-Net U-Net U-Net U-Net

Clip-image MOVQ MOVQ MOVQ MOVQ MOVQ


Encoder Decoder Decoder Decoder Decoder Decoder

image prior
text to image variation image fusion fusion with text inpainting
training

Figure 1: Image prior scheme and inference regimes of the Kandinsky model.

2 Related Work which are now at the forefront of this domain.


Diffusion models achieve state-of-the-art results
Early text-to-image generative models, such as in image generation task both unconditional (Ho
DALL-E (Ramesh et al., 2021) and CogView (Ding et al., 2020; Nichol and Dhariwal, 2021) and con-
et al., 2021), or later Parti (Yu et al., 2022) em- ditional (Peebles and Xie, 2022). They beat GANs
ployed autoregressive approaches but often suf- (Goodfellow et al., 2014) by generating images
fered from significant content-level artifacts. This with better scores of fidelity and diversity without
led to the development of a new breed of models adversarial training (Dhariwal and Nichol, 2021).
that utilized the diffusion process to enhance image Diffusion models also show the best performance
quality. Diffusion-based models, such as DALL- in various image processing tasks like inpainting,
E 2 (Ramesh et al., 2022), Imagen (Saharia et al., outpainting, and super-resolution (Batzolis et al.,
2022b), and Stable Diffusion6 , have since become 2021; Saharia et al., 2022a).
cornerstones in this domain. These models are typ- Text-to-image diffusion models have become
ically divided into pixel-level (Ramesh et al., 2022; a popular research direction due to the high per-
Saharia et al., 2022b) and latent-level (Rombach formance of diffusion models and the ability to
et al., 2022) approaches. simply integrate text conditions with the classifier-
This surge of interest has led to the design of free guidance algorithm (Ho and Salimans, 2022).
innovative approaches and architectures, paving Early models like GLIDE (Nichol et al., 2022), Im-
the way for numerous applications based on open- agen (Saharia et al., 2022b), DALL-E 2 (Ramesh
source generative models, such as DreamBooth et al., 2022) and eDiff-I (Balaji et al., 2022) gen-
(Ruiz et al., 2023) and DreamPose (Karras et al., erate low-resolution image in pixel space and then
2023). These applications exploit image generation upsample it with another super-resolution diffusion
techniques to offer remarkable features, further fu- models. They are also using different text encoders,
eling the popularity and the rapid development of large language model T5 (Raffel et al., 2020) in
diffusion-based image generation approaches. Imagen, CLIP (Radford et al., 2021) in GLIDE and
This enabled a wide array of applications like DALL-E 2.
3D object synthesis (Poole et al., 2023; Tang et al.,
2023; Lin et al., 2022; Chen et al., 2023), video 3 Demo System
generation (Ho et al., 2022b; Luo et al., 2023; Ho
We implemented a set of user-oriented solutions
et al., 2022a; Singer et al., 2023; Blattmann et al.,
where Kandinsky model is embedded as a core
2023; Esser et al., 2023), controllable image editing
imaging service. It has been done due to a variety
(Hertz et al., 2023; Parmar et al., 2023; Liew et al.,
of inference regimes, some of which need specific
2022; Mou et al., 2023; Lu et al., 2023), and more,
front-end features to perform properly. Overall, we
6
https://github.com/CompVis/stable-diffusion implemented two main inference resources: Tele-
Figure 2: Examples of inference regimes using Kandinsky model.

gram bot and FusionBrain website. • image variations – user inputs an image, and
FusionBrain represents a web-based image edi- the system generates several new images sim-
tor with such features as loading and saving images, ilar to the input one.
sliding location window, erasing tools, zooming
in/out, various styles selector, etc. (cf. Figure 3). 4 Kandinsky Architecture
In terms of image generation, the three following
In our work, we opted to deliver state-of-the-art
options are implemented on this side:
text-to-image synthesis. In the initial stages of our
• text-to-image generation – user inputs a text research, we experimented with multilingual text
prompt in Russian or English, then selects an encoders, such as mT5 (Xue et al., 2021), XLMR
aspect-ratio from the list (9:16, 2:3, 1:1, 16:9, (Conneau et al., 2020), XLMR-CLIP7 , to facili-
3:2), and the system generates an image; tate robust multilingual text-to-image generation.
• inpainting – using the specific erasing tool, However, we discovered that using the CLIP-image
user can remove any arbitrary input image embeddings instead of standalone text encoders re-
part and fill it, guided by a text prompt or sulted in improved image quality. As a result, we
without any guidance; adopted an image prior approach, utilizing diffu-
sion and linear mappings between text and image
• outpainting – input image can be extended embedding spaces of CLIP, while keeping addi-
with a sliding window that can be used as tional conditioning with XLMR text embeddings.
a mask for the following generation (if the That is why Kandinsky uses two text encoders:
window intersects any imaged area, then the CLIP-text with image prior mapping and XLMR.
empty window part is generated with or with- We have set these encoders to be frozen during the
out text prompt guidance). training phase.
Inpainting and outpainting options are the main The significant factor that influenced our design
image editing features of the model. Architectural choice was the efficiency of training latent diffu-
details about these generation types can also be sion models, as compared to pixel-level diffusion
found in Figure 1. models (Rombach et al., 2022). This led us to fo-
Telegram bot contains the following image gen- cus our efforts on the latent diffusion architecture.
eration features (cf. Figure 2): Our model essentially comprises three stages: text
• text-to-image generation; encoding, embedding mapping (image prior), and
latent diffusion.
• image and text fusion – user inputs an im- The construction of our model involves three
age and a text prompt to create a new image primary steps: text encoding, embedding mapping
guided by this prompt; (image prior), and latent diffusion. At the embed-
• image fusion – user inputs an image as the ding mapping step, which we also refer to as the
main one and another ’guiding’ image, and 7
https://github.com/FreddeFrallan/
the system generates their fusion; Multilingual-CLIP
Figure 3: Kandinsky web interface for “a corgi gliding on the wave”: generation (left) and in/outpainting (right).

Table 1: Proposed architecture comparison by FID on We conducted a series of experiments and ablation
COCO-30K validation set on 256×256 resolution. * For studies on the specific architecture design of the
the IF model we reported reproduced results on COCO-
image prior model (Table 3, Figure 6). The model
30K, but authors provide FID of 7.19.
with the best human evaluation score is based on
Model FID-30K a 1D-diffusion and standard transformer-encoder
Open Sourced Techologies with the following parameters: num_layers=20,
Kandinsky (Ours) 8.03 num_heads=32, and hidden_size=2048.
Stable Diffusion 2.1 (2022) 8 8.59 The latent diffusion part employs a UNet model
GLIDE 8 (Nichol et al., 2022) 12.24
along with a custom pre-trained autoencoder. Our
IF* (2023) 12 15.10
diffusion model uses a combination of multiple
Kandinsky 1.0 (2022) 9 15.40
ruDALL-E Malevich (2022) 9 20.00 condition signals: CLIP-image embeddings, CLIP-
GLIGEN 10 (Li et al., 2023) 21.04 text embeddings, and XLMR-CLIP text embed-
Proprietary Technologies dings. CLIP-image and XLMR-CLIP embeddings
eDiff-I (Balaji et al., 2022) 6.95 are merged and utilized as an input to the latent
Imagen (Saharia et al., 2022b) 7.27 diffusion process. Also, we conditioned the dif-
GigaGAN (Kang et al., 2023) 9.09 fusion process on these embeddings by adding all
DALL-E 2 (Ramesh et al., 2022) 10.39
of them to the time-embedding. Notably, we did
DALL-E (Ramesh et al., 2021) 17.89
not skip the quantization step of the autoencoder
during diffusion inference as it leads to an increase
image prior, we use the transformer-encoder model. in the diversity and the quality of generated images
This model was trained from scratch with a diffu- (cf. Figure 4). In total, our model comprises 3.3 B
sion process on text and image embeddings pro- parameters (Table 2).
vided by the CLIP-ViT-L14 model. A noteworthy
feature in our training process is the use of element- Table 2: Kandinsky model parameters.
wise normalization of visual embeddings. This Architecture part Params Freeze
normalization is based on full-dataset statistics and Diffusion Mapping 1B False
leads to faster convergence of the diffusion process. CLIP image encoder (ViT-L14) 427M True
We implemented inverse normalization to revert to CLIP text encoder 340M True
the original CLIP-image embedding space in the Text encoder (XLM-R-L) 560M True
inference stage. Latent Diffusion UNet 1.22B False
MoVQ image autoencoder 67M True
The image prior model is trained on text and
image embeddings, provided by the CLIP models.
8
We observed that the image decoding was our
https://github.com/Stability-AI/
stablediffusion
main bottleneck in terms of generated image qual-
9
https://github.com/ai-forever/ru-dalle ity; hence, we developed a Sber-MoVQGAN, our
10
https://github.com/gligen/GLIGEN custom implementation of MoVQGAN (Zheng
Table 3: Ablation study: FID on COCO-30K validation
set on 256 × 256 resolution.

Setup FID-30K CLIP


Diffusion prior with quantization 9.86 0.287
Diffusion prior w/o quantization 9.87 0.286
Linear prior 8.03 0.261
Residual prior 8.61 0.249
No prior 25.92 0.256

et al., 2022) with minor modifications. We trained


this autoencoder on the LAION HighRes dataset
(Schuhmann et al., 2022), obtaining the SotA re-
sults in image reconstruction. We released the
Figure 4: CLIP-FID curves for different setups.
weights and code for these models under an open
source licence11 . The comparison of our autoen-
coder with competitors can be found in Table 4.

5 Experiments
We sought to evaluate and refine the performance
of our proposed latent diffusion architecture in our
experimental analysis. To this end, we employed
automatic metrics, specifically FID-CLIP curves
on the COCO-30K dataset, to obtain the optimal
guidance-scale value and compare Kandinsky with
competitors (cf. Figure 4). Furthermore, we con- Figure 5: Image generation results with prompt "astro-
ducted investigations with various image prior se- naut riding a horse" for original image prior and linear
tups, exploring the impact of different configura- prior trained on 500 pairs of images with cats.
tions on the performance. These setups included:
no prior, utilizing text embeddings directly; lin- back and validate the quality of the generated im-
ear prior, implementing one linear layer; ResNet ages from the perspective of human perception
prior, consisting of 18 residual MLP blocks; and based on the DrawBench dataset (Saharia et al.,
transformer diffusion prior. 2022b).
An essential aspect of our experiments was the
The combination of automatic metrics and hu-
exploration of the effect of latent quantization
man evaluation provides a comprehensive assess-
within the MoVQ autoencoder. We examined the
ment of Kandinsky performance, enabling us to
outputs with latent quantization, both enabled and
make informed decisions about the effectiveness
disabled, to better comprehend its influence on im-
and usability of our proposed image prior to design.
age generation quality.
To ensure a comprehensive evaluation, we also
included an assessment of the IF model 12 , which is 6 Results
the closest open-source competitor to our proposed
model. For this purpose, we computed FID scores Our experiments and evaluations have showcased
for the IF model 13 (Table 1). the capabilities of Kandinsky architecture in text-to-
However, we acknowledged the limitations of au- image synthesis. Kandinsky achieved the FID score
tomatic metrics that become obvious when it comes of 8.03 on the COCO-30K validation set at a resolu-
to capturing user experience nuances. Hence, in tion of 256×256, which puts it in close competition
addition to the FID-CLIP curves, we conducted a with the state-of-the-art models, and among the
blind human evaluation to obtain insightful feed- top performers within open-source systems. Our
11
https://github.com/ai-forever/MoVQGAN
methodical ablation studies further dissected the
12
https://github.com/deep-floyd/IF performance of different configurations: quantiza-
13
https://github.com/mseitzer/pytorch-fid tion of latent codes in MoVQ slightly improves
Figure 6: Human evaluation: competitors vs Kandinsky with diffusion prior on Drawbench. The total count of votes
is 5000.

Table 4: Sber-MoVQGAN comparison with competitors on ImageNet dataset.

Model Latent size Num Z Train steps FID ↓ SSIM ↑ PSNR ↑ L1 ↓


ViT-VQGAN* 32x32 8192 500,000 1.28 - - -
RQ-VAE* 8x8x16 16384 10 epochs 1.83 - - -
Mo-VQGAN* 16x16x4 1024 40 epochs 1.12 0.673 22.42 -
VQ CompVis 32x32 16384 971,043 1.34 0.650 23.85 0.0533
KL CompVis 32x32 - 246,803 0.968 0.692 25.11 0.0474
Sber-VQGAN 32x32 8192 1 epoch 1.44 0.682 24.31 0.0503
Sber-MoVQGAN 67M 32x32 1024 5,000,000 1.34 0.704 25.68 0.0451
Sber-MoVQGAN 67M 32x32 16384 2,000,000 0.965 0.725 26.45 0.0415
Sber-MoVQGAN 102M 32x32 16384 2,360,000 0.776 0.737 26.89 0.0398
Sber-MoVQGAN 270M 32x32 16384 1,330,000 0.686 0.741 27.04 0.0393

the quality of images (FID 9.86 vs 9.87). The best license enabling various, including commercial, ap-
CLIP score and human-eval score are obtained by plications of the developed technology.
diffusion prior. In future research, our goal is to investigate the
The best FID score is achieved using Linear potential of the latest image encoders. We plan to
Prior. This configuration stands out with the best explore the development of more efficient UNet
FID score of 8.03. It is an intriguing outcome: the architectures for text-to-image tasks and focus on
simplest linear mapping showcased the best FID, improving the understanding of textual prompts.
suggesting that there might exist a linear relation- Additionally, we aim to experiment with generat-
ship between visual and textual embedding vector ing images at higher resolutions and to investigate
spaces. To further scrutinize this hypothesis, we new features extending the model: local image
trained a linear mapping on a subset of 500 cat editing by a text prompt, attention reweighting,
images and termed it the "cat prior". Astonish- physics-based generation control, etc. The robust-
ingly, this mapping displayed high proficiency (cf. ness against generating abusive content remains
Figure 5). a crucial concern, warranting the exploration of
real-time moderation layers or robust classifiers to
7 Conclusion
mitigate undesirable, e.g. toxic or abusive, outputs.
We presented Kandinsky, a system for various im-
age generation and processing tasks based on a
novel latent diffusion model. Our model yielded 8 Limitations
the SotA results among open-sourced systems. Ad-
ditionally, we provided an extensive ablation study The current system produces images that appear
of an image prior to design choices. Our system is natural, however, additional research can be con-
equipped with free-to-use interfaces in the form of ducted to (1) enhance the semantic coherence be-
Web application and Telegram messenger bot. The tween the input text and the generated image, and
pre-trained models are available on Hugging Face, (2) to improve the absolute values of FID and im-
and the source code is released under a permissive age quality based on human evaluations.
9 Ethical Considerations line, July 5-10, 2020, pages 8440–8451. Association
for Computational Linguistics.
We performed multiple efforts to ensure that the
generated images do not contain harmful, offen- Prafulla Dhariwal and Alexander Quinn Nichol. 2021.
Diffusion models beat gans on image synthesis. In
sive, or abusive content by (1) cleansing the train- Advances in Neural Information Processing Systems
ing dataset from samples that were marked to be 34: Annual Conference on Neural Information Pro-
harmful/offensive/abusive, and (2) detecting abu- cessing Systems 2021, NeurIPS 2021, December 6-
sive textual prompts. 14, 2021, virtual, pages 8780–8794.
While obvious queries, according to our tests, al- Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
most never generate abusive content, technically it Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou
is not guaranteed that certain carefully engineered Shao, Hongxia Yang, and Jie Tang. 2021. Cogview:
Mastering text-to-image generation via transformers.
prompts may not yield undesirable content. We,
In Advances in Neural Information Processing Sys-
therefore, recommend using an additional layer of tems 34: Annual Conference on Neural Information
classifiers, depending on the application, which Processing Systems 2021, NeurIPS 2021, December
would filter out the undesired content and/or use 6-14, 2021, virtual, pages 19822–19835.
image/representation transformation methods tai- Patrick Esser, Johnathan Chiu, Parmida Atighehchian,
lored to a given application. Jonathan Granskog, and Anastasis Germanidis. 2023.
Structure and content-guided video synthesis with
Acknowledgements diffusion models. CoRR, abs/2302.03011.

As usual, we would like to thank the anonymous Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
reviewers for their useful comments. We would Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C.
Courville, and Yoshua Bengio. 2014. Generative ad-
also like to thank Sergey Markov and his team for versarial nets. In Advances in Neural Information
helpful feedback and discussions, for collabora- Processing Systems 27: Annual Conference on Neu-
tion in multimodal dataset collecting, labelling and ral Information Processing Systems 2014, December
processing. 8-13 2014, Montreal, Quebec, Canada, pages 2672–
2680.
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber-
References man, Yael Pritch, and Daniel Cohen-Or. 2023.
Prompt-to-prompt image editing with cross-attention
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vah- control. In The Eleventh International Conference
dat, Jiaming Song, Karsten Kreis, Miika Aittala, on Learning Representations, ICLR 2023, Kigali,
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Rwanda, May 1-5, 2023. OpenReview.net.
Karras, and Ming-Yu Liu. 2022. ediff-i: Text-to-
image diffusion models with an ensemble of expert Jonathan Ho, William Chan, Chitwan Saharia, Jay
denoisers. Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P.
Kingma, Ben Poole, Mohammad Norouzi, David J.
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schön- Fleet, and Tim Salimans. 2022a. Imagen video: High
lieb, and Christian Etmann. 2021. Conditional image definition video generation with diffusion models.
generation with score-based diffusion models. CoRR, CoRR, abs/2210.02303.
abs/2111.13606.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De-
Andreas Blattmann, Robin Rombach, Huan Ling, Tim noising diffusion probabilistic models. In Advances
Dockhorn, Seung Wook Kim, Sanja Fidler, and in Neural Information Processing Systems 33: An-
Karsten Kreis. 2023. Align your latents: High- nual Conference on Neural Information Processing
resolution video synthesis with latent diffusion mod- Systems 2020, NeurIPS 2020, December 6-12, 2020,
els. CoRR, abs/2304.08818. virtual.
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Jonathan Ho and Tim Salimans. 2022. Classifier-free
2023. Fantasia3d: Disentangling geometry and ap- diffusion guidance. volume abs/2207.12598.
pearance for high-quality text-to-3d content creation.
CoRR, abs/2303.13873. Jonathan Ho, Tim Salimans, Alexey A. Gritsenko,
William Chan, Mohammad Norouzi, and David J.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Fleet. 2022b. Video diffusion models. In NeurIPS.
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik
moyer, and Veselin Stoyanov. 2020. Unsupervised Park, Eli Shechtman, Sylvain Paris, and Taesung
cross-lingual representation learning at scale. In Pro- Park. 2023. Scaling up gans for text-to-image synthe-
ceedings of the 58th Annual Meeting of the Associa- sis. In IEEE/CVF Conference on Computer Vision
tion for Computational Linguistics, ACL 2020, On- and Pattern Recognition, CVPR 2023, Vancouver,
BC, Canada, June 17-24, 2023, pages 10124–10134. Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben
IEEE. Mildenhall. 2023. Dreamfusion: Text-to-3d using
2d diffusion. In The Eleventh International Con-
Johanna Karras, Aleksander Holynski, Ting-Chun ference on Learning Representations, ICLR 2023,
Wang, and Ira Kemelmacher-Shlizerman. 2023. Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Dreampose: Fashion image-to-video synthesis via
stable diffusion. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou try, Amanda Askell, Pamela Mishkin, Jack Clark,
Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
Yong Jae Lee. 2023. GLIGEN: open-set grounded ing transferable visual models from natural language
text-to-image generation. CoRR, abs/2301.07093. supervision. In Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18-24
Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi July 2021, Virtual Event, volume 139 of Proceedings
Feng. 2022. Magicmix: Semantic mixing with diffu- of Machine Learning Research, pages 8748–8763.
sion models. CoRR, abs/2210.16056. PMLR.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2022. Wei Li, and Peter J. Liu. 2020. Exploring the limits
Magic3d: High-resolution text-to-3d content creation. of transfer learning with a unified text-to-text trans-
CoRR, abs/2211.10440. former. J. Mach. Learn. Res., 21:140:1–140:67.
Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
2023. TF-ICON: diffusion-based training-free cross- Chu, and Mark Chen. 2022. Hierarchical text-
domain image composition. CoRR, abs/2307.12493. conditional image generation with CLIP latents.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
Huang, Liang Wang, Yujun Shen, Deli Zhao, Jin- Gray, Chelsea Voss, Alec Radford, Mark Chen, and
gren Zhou, and Tieniu Tan. 2023. Videofusion: De- Ilya Sutskever. 2021. Zero-shot text-to-image gen-
composed diffusion models for high-quality video eration. In Proceedings of the 38th International
generation. CoRR, abs/2303.08320. Conference on Machine Learning, ICML 2021, 18-24
July 2021, Virtual Event, volume 139 of Proceedings
Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, of Machine Learning Research, pages 8821–8831.
and Jian Zhang. 2023. Dragondiffusion: Enabling PMLR.
drag-style manipulation on diffusion models. CoRR,
abs/2307.02421. Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. 2022. High-
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. resolution image synthesis with latent diffusion mod-
Improved denoising diffusion probabilistic models. els. In IEEE/CVF Conference on Computer Vision
In Proceedings of the 38th International Conference and Pattern Recognition, CVPR 2022, New Orleans,
on Machine Learning, ICML 2021, 18-24 July 2021, LA, USA, June 18-24, 2022, pages 10674–10685.
Virtual Event, volume 139 of Proceedings of Machine IEEE.
Learning Research, pages 8162–8171. PMLR.
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Pritch, Michael Rubinstein, and Kfir Aberman. 2023.
Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Dreambooth: Fine tuning text-to-image diffusion
Grew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: models for subject-driven generation. In IEEE/CVF
towards photorealistic image generation and editing Conference on Computer Vision and Pattern Recog-
with text-guided diffusion models. In International nition, CVPR 2023, Vancouver, BC, Canada, June
Conference on Machine Learning, ICML 2022, 17-23 17-24, 2023, pages 22500–22510. IEEE.
July 2022, Baltimore, Maryland, USA, volume 162 of
Proceedings of Machine Learning Research, pages Chitwan Saharia, William Chan, Huiwen Chang,
16784–16804. PMLR. Chris A. Lee, Jonathan Ho, Tim Salimans, David J.
Fleet, and Mohammad Norouzi. 2022a. Palette:
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Image-to-image diffusion models. In SIGGRAPH
Yijun Li, Jingwan Lu, and Jun-Yan Zhu. 2023. ’22: Special Interest Group on Computer Graphics
Zero-shot image-to-image translation. In ACM SIG- and Interactive Techniques Conference, Vancouver,
GRAPH 2023 Conference Proceedings, SIGGRAPH BC, Canada, August 7 - 11, 2022, pages 15:1–15:10.
2023, Los Angeles, CA, USA, August 6-10, 2023, ACM.
pages 11:1–11:11. ACM.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala
William Peebles and Saining Xie. 2022. Scal- Li, Jay Whang, Emily L. Denton, Seyed Kam-
able diffusion models with transformers. CoRR, yar Seyed Ghasemipour, Raphael Gontijo Lopes,
abs/2212.09748. Burcu Karagol Ayan, Tim Salimans, Jonathan Ho,
David J. Fleet, and Mohammad Norouzi. 2022b. Pho-
torealistic text-to-image diffusion models with deep
language understanding.
Christoph Schuhmann, Romain Beaumont, Richard
Vencu, Cade Gordon, Ross Wightman, Mehdi
Cherti, Theo Coombes, Aarush Katta, Clayton
Mullis, Mitchell Wortsman, Patrick Schramowski,
Srivatsa Kundurthy, Katherine Crowson, Ludwig
Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.
2022. LAION-5B: an open large-scale dataset for
training next generation image-text models. In
NeurIPS.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie
An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron
Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and
Yaniv Taigman. 2023. Make-a-video: Text-to-video
generation without text-video data. In The Eleventh
International Conference on Learning Representa-
tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net.
Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang,
Ran Yi, Lizhuang Ma, and Dong Chen. 2023. Make-
it-3d: High-fidelity 3d creation from A single image
with diffusion prior. CoRR, abs/2303.14184.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mt5: A massively multilingual
pre-trained text-to-text transformer. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2021,
Online, June 6-11, 2021, pages 483–498. Association
for Computational Linguistics.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Lu-
ong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
Alexander Ku, Yinfei Yang, Burcu Karagol Ayan,
Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li,
Han Zhang, Jason Baldridge, and Yonghui Wu. 2022.
Scaling autoregressive models for content-rich text-
to-image generation. Trans. Mach. Learn. Res.,
2022.

Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and


Dinh Phung. 2022. Movq: Modulating quantized vec-
tors for high-fidelity image generation. In NeurIPS.
A Additional generation examples

You might also like